Accessing PL field is slow when number of samples gets large
astaric opened this issue · 0 comments
Using the master version of pysam and the following benchmark script: https://github.com/astaric/pysam/blob/speed-up-bcf-genotype-count/tests/VariantRecordPL_bench.py I get the following runtimes for accessing the PL field for all samples in a single record:
---------------------------------------------------------------------------------------------------------------- benchmark: 5 tests ----------------------------------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_access_pl_values_10_samples 38.2920 (1.0) 185.3330 (1.0) 41.2211 (1.0) 8.9669 (inf) 39.9170 (1.0) 2.3340 (inf) 15;33 24,259.4229 (1.0) 733 1
test_access_pl_values_100_samples 136.3750 (3.56) 546.6249 (2.95) 142.9587 (3.47) 12.5355 (inf) 139.7920 (3.50) 3.6876 (inf) 119;289 6,995.0286 (0.29) 3296 1
test_access_pl_values_1000_samples 3,246.3749 (84.78) 8,651.0000 (46.68) 3,340.0859 (81.03) 425.3084 (inf) 3,273.8131 (82.02) 87.7501 (inf) 2;8 299.3935 (0.01) 274 1
test_access_pl_values_10000_samples 257,407.6669 (>1000.0) 270,355.5422 (>1000.0) 261,211.9690 (>1000.0) 6,149.1067 (inf) 258,542.3335 (>1000.0) 7,216.5631 (inf) 1;0 3.8283 (0.00) 4 1
test_access_pl_values_100000_samples 25,141,500.8749 (>1000.0) 25,141,500.8749 (>1000.0) 25,141,500.8749 (>1000.0) 0.0000 (1.0) 25,141,500.8749 (>1000.0) 0.0000 (1.0) 0;0 0.0398 (0.00) 1 1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
When number of samples gets large, having 10x the samples results in 100x slower execution which indicates quadratic complexity of the operation.
I looked into the codebase and found that most of the time is spent in bcf_get_genotypes. This function is called as part of computing the number of elements in the PL field before each access to the field. bcf_get_genotypes
creates an array of GT information for all samples, which is then used to compute max_ploidy
. The only part of the array being accessed is the GT information for "current" sample, the rest is discarded.
I am not really sure what the idea behind this max_ploidy
check is, if it could be removed one could just access the GT array for the current sample and count the number of values to get the ploidy, which should be much faster than creating the array for all samples.