Optimize gp_clinical_to_stem

Question

Optimize gp_clinical_to_stem

Closed this issue 4 years ago · 3 comments

MaximMoinat commented 4 years ago

For the first run on the real data, the gp_clinical_to_stem was relatively slow (1.5h on a total ETL runtime of 3h).

We should investigate what part of the function is slowest and how we can optimize that.

Answer 1 · 2021-02-01T10:17:21.000Z

Todo: run the Intelij profiler on this function to find what part of the script is needs optimisation

Answer 2 · 2021-03-08T14:55:23.000Z

small set, 100

call_graph.pdf

main (64.887s) > run > transform > execute_transformation > ...

... gp_clinical_to_stem_table [25.0%] > ...
... generate_code_mapping_dictionary [45.4%] (higher, because called from other functions too)

synthetic set, 5000

all functions

call_graph_large.pdf

main (370.0s) > run > transform > execute_transformation > ...

... gp_clinical_to_stem_table [5.2%, 19s] > ...
... generate_code_mapping_dictionary [8.3%, 31s]

(other heavy transformations: ...> transform > execute_batch_transformation > baseline_to_stem [14.6%])

only gp_clinical_to_stem transformation

call_graph_large_2funcs.pdf

main (28.4s) > run > transform > execute_transformation > ...

... gp_clinical_to_stem_table [56.8%, 16s] > ...
... generate_code_mapping_dictionary [44.8%, 13s]

Answer 3 · 2021-03-10T09:55:53.000Z

The function gp_clinical_to_stem_table spends most of its running time with generate_code_mapping_dictionary. This function is independent of the number of rows in the source data tables. With a larger data set, the time is expected not to increase substantially.
Close.