EHDEN/ETL-UK-Biobank

Optimize gp_clinical_to_stem

Closed this issue · 3 comments

For the first run on the real data, the gp_clinical_to_stem was relatively slow (1.5h on a total ETL runtime of 3h).

We should investigate what part of the function is slowest and how we can optimize that.

Todo: run the Intelij profiler on this function to find what part of the script is needs optimisation

small set, 100

call_graph.pdf

main (64.887s) > run > transform > execute_transformation > ...

  • ... gp_clinical_to_stem_table [25.0%] > ...
  • ... generate_code_mapping_dictionary [45.4%] (higher, because called from other functions too)

synthetic set, 5000

all functions

call_graph_large.pdf

main (370.0s) > run > transform > execute_transformation > ...

  • ... gp_clinical_to_stem_table [5.2%, 19s] > ...
  • ... generate_code_mapping_dictionary [8.3%, 31s]

(other heavy transformations: ...> transform > execute_batch_transformation > baseline_to_stem [14.6%])

only gp_clinical_to_stem transformation

call_graph_large_2funcs.pdf

main (28.4s) > run > transform > execute_transformation > ...

  • ... gp_clinical_to_stem_table [56.8%, 16s] > ...
  • ... generate_code_mapping_dictionary [44.8%, 13s]

The function gp_clinical_to_stem_table spends most of its running time with generate_code_mapping_dictionary. This function is independent of the number of rows in the source data tables. With a larger data set, the time is expected not to increase substantially.
Close.