Optimize gp_clinical_to_stem
Closed this issue · 3 comments
For the first run on the real data, the gp_clinical_to_stem
was relatively slow (1.5h on a total ETL runtime of 3h).
We should investigate what part of the function is slowest and how we can optimize that.
Todo: run the Intelij profiler on this function to find what part of the script is needs optimisation
small set, 100
main (64.887s) > run > transform > execute_transformation > ...
- ... gp_clinical_to_stem_table [25.0%] > ...
- ... generate_code_mapping_dictionary [45.4%] (higher, because called from other functions too)
synthetic set, 5000
all functions
main (370.0s) > run > transform > execute_transformation > ...
- ... gp_clinical_to_stem_table [5.2%, 19s] > ...
- ... generate_code_mapping_dictionary [8.3%, 31s]
(other heavy transformations: ...> transform > execute_batch_transformation > baseline_to_stem [14.6%])
only gp_clinical_to_stem transformation
main (28.4s) > run > transform > execute_transformation > ...
- ... gp_clinical_to_stem_table [56.8%, 16s] > ...
- ... generate_code_mapping_dictionary [44.8%, 13s]
The function gp_clinical_to_stem_table
spends most of its running time with generate_code_mapping_dictionary
. This function is independent of the number of rows in the source data tables. With a larger data set, the time is expected not to increase substantially.
Close.