fgvieira/ngsLD

Difference between python and perl LD pruning scripts

dmacguigan opened this issue · 3 comments

Thank you to @zjnolen for the new LD pruning python script! It's a great runtime improvement. A single linkage group that took ~72 hours with the perl script now takes less than an hour!

However, I was hoping for some clarification about differences between the perl and python LD pruning scripts. I ran both scripts with the same input file, same max_dist, same min_weight, and all other params default. However, the python script keeps 5X more SNPs. Is there something different going on under the hood?

Perl script:

# 68000 nodes pruned; 13262 linked nodes remaining ...
# 69000 nodes pruned; 10964 linked nodes remaining ...
# 70000 nodes pruned; 8964 linked nodes remaining ...
# 71000 nodes pruned; 6964 linked nodes remaining ...
# 72000 nodes pruned; 4964 linked nodes remaining ...
# 73000 nodes pruned; 2964 linked nodes remaining ...
# 74000 nodes pruned; 964 linked nodes remaining ...
### Run finished (final graph with 10386 nodes)

Python script:

56000 positions remaining with 4447 edges between them...
55000 positions remaining with 3447 edges between them...
54000 positions remaining with 2447 edges between them...
53000 positions remaining with 1447 edges between them...
52000 positions remaining with 447 edges between them...
Pruning complete! Pruned graph contains 51553 positions.
Exporting kept positions to file...
Total runtime: 0:55:48.123912

Hey @dmacguigan!

Glad that it seems to have helped speed things up for you! These should be returning the same sites both in terms of number and identity, as this was one of the things we looked into before merging. I can think of a couple of things to check first, if you haven't already:

  1. Do the output files themselves have different numbers of lines? The number in the completion message can differ between the two scripts - the python script includes sites that start as unlinked when it prunes the graph, but the perl script removes these before pruning. However, both scripts add them to the final list. So the 'pruned graph' in the message at the end of each script can be different depending on how many sites were unlinked at the start, with the completion line for perl lower than the python line. But, if the output files themselves also differ in number of lines, then something is still off.

  2. For the max_dist option, the scale is different between the scripts. In the perl script it is --max_kb_dist and the value should be in kilobases, but in the python script it is --max_dist and the value should be in bases. If the value isn't multiplied by 1000 when moving to the python script, I think it would result in something like what you see, since the threshold for 'linking' sites would be a much lower distance than intended.

If it isn't either of these, would you be open to sharing a subset of your input file that replicates the issue?

Best,
Zach

Aha, I didn't read carefully enough. As you suspected, I used kbp for the --max_dist parameter in the python script.

Upon rerunning the python pruning script with bp instead of kbp, I now get the same number of retained SNPs as the perl script.

Thank you again! This python script is a lifesaver for those of us with strict walltime limits.

Wonderful, glad to hear that fixed it for you!