fgvieira/ngsLD

Value Error in prune_ngsLD.py

TrishikaS opened this issue · 16 comments

Hi again:)

I am trying now to run prune_ngsLD.py script, but I get an error:

Checking if file is gzipped...
Reading in data...
Filtering edges by distance...
Filtering edges by weight...
Beginning pruning by dropping heaviest position...
Starting with 77988 positions with 393573 edges between them...
Traceback (most recent call last):
File "/home/ubuntu/ngsLD/scripts/prune_ngsLD.py", line 160, in
map_property_values(G.ep["weight"], edge_weight, lambda x: int(x * weight_precision))
File "/home/ubuntu/anaconda3/envs/ngsLD_N/lib/python3.11/site-packages/graph_tool/init.py", line 1191, in map_property_values
libcore.property_map_values(u._Graph__graph,
File "/home/ubuntu/ngsLD/scripts/prune_ngsLD.py", line 160, in
map_property_values(G.ep["weight"], edge_weight, lambda x: int(x * weight_precision))
^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot convert float NaN to integer


Would you have any suggestion and ideas how to fix this issue?

Thanks in advance

Do you have any missing values on your input file?

Yes, I generate LD_input file from beagle file. And I guess I do have some NaNs in there.
Should I and how, remove them?

If this helps, these are my steps ...

angsd -GL 2 -out $OUT/chr25_gl -minMapQ 30 -minQ 20 -nThreads 10 -doGlf 2 -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam $BAM\

#2.make LD input file

ngsLD --geno chr25_gl.beagle.gz --probs --ignore_miss_data --rnd_sample 0.05 --seed 1 --n_ind 20 --n_sites 78000 --pos sites --out LD_input

#3. run py script for pruning

prune_ngsLD.py --input LD_input --max_dist 50000 --min_weight 0.1 --out testLD_unlinked.pos

Can you paste here some lines where you have missing data?

25:449 25:3558 3109 0.135633 -0.000000 nan nan
25:449 25:4145 3696 0.015543 -0.023482 0.999814 0.035279
25:449 25:4525 4076 0.006558 -0.017165 0.224540 0.007366
25:449 25:5174 4725 0.169403 0.169219 0.999968 0.999910
25:449 25:6389 5940 0.032531 -0.103020 0.999272 0.227801
25:449 25:7569 7120 0.153657 -0.080698 1.000000 0.157663
25:449 25:7800 7351 0.121338 -0.125412 0.999920 0.310249
25:449 25:14675 14226 0.135194 -0.145148 0.999973 0.408850
25:449 25:15575 15126 0.133985 -0.145086 0.999972 0.408497
25:449 25:30626 30177 0.024597 -0.000000 nan nan

I suspect that the problem is that some of those sites are not variable. If so, you can try increasing the option --min_maf 0.03.

If not, can you send me the information for sites 25:449 and 25:3558 on the maf and beagle files?

Thanks for your reply,
I just tried it out with --min_maf 0.03 and it gave the same error. When I remove lines with nan (around 10% of all sites) the script works just fine.
So here is the information for sites 25:449 and 25:3558

Mafs
25 449 T C 0.269521 4.790291e-08 7

Beagle
25_449 3 1 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.666622 0.333333 0.000044 0.333333 0.333333 0.333333 0.000044 0.333333 0.666622 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.666622 0.333333 0.000044 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.665962 0.333333 0.000704 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.799979 0.200021 0.000000 0.000044 0.333333 0.666622 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.665962 0.333333 0.000704

Mafs
25 3558 C T 0.285733 4.107843e-08 7

Beagle
25_3558 1 3 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.666622 0.333333 0.000044 0.000044 0.333333 0.666622 0.000044 0.333333 0.666622 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.666622 0.333333 0.000044 0.666622 0.333333 0.000044 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.665962 0.333333 0.000704 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.666622 0.333333 0.000044 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333

What is the coverage of these samples?
Can you try running ngsLD with --min_maf 0.03 but without --ignore_miss_data.

The coverage is ~10x and we want to run analysis also with 4x and 2x coverage in the future.
I tried running ngsLD without --ignore_miss_data, and again the same error appeared:(

Checking if file is gzipped...
Reading in data...
Filtering edges by distance...
Filtering edges by weight...
Beginning pruning by dropping heaviest position...
Starting with 76695 positions with 395834 edges between them...
Traceback (most recent call last):
File "/home/ubuntu/ngsLD/scripts/prune_ngsLD.py", line 160, in
map_property_values(G.ep["weight"], edge_weight, lambda x: int(x * weight_precision))
File "/home/ubuntu/anaconda3/envs/ngsLD_N/lib/python3.11/site-packages/graph_tool/init.py", line 1191, in map_property_values
libcore.property_map_values(u._Graph__graph,
File "/home/ubuntu/ngsLD/scripts/prune_ngsLD.py", line 160, in
map_property_values(G.ep["weight"], edge_weight, lambda x: int(x * weight_precision))

But with 10x coverage, how do you have so much missing data (those two sites have 70% missing data)?! Is it whole genome or targeted sequencing?

yeah 70% seems a lot. It's whole genome sequencing, I used only one chromosome just to check if the script is running ok.

Could it be some repetitive regions? If so, maybe it is good to use a minInd when running angsd.

So, I tested with -minind 10 (I have 20 ind overall), and then I ran ngsLD with min_maf 0.03 and no missing data, and I still get the same error... Quick explanation: Those 20 samples are from 2 populations (10 ind per pop) with Fst 2%. The species I'm working with has low diversity anyways.

Despite this issue, should the py script ignore NaNs? Is there some way to circumvent this issue? When I deleted lines with nan, the py script worked just fine.

It should be ok to remove NaNs, but I am just a bit surprised by the amount of missing data on samples that were sequenced at 10x.

Just noticed that it seems you run angsd with just one BAM? But then, how do you get the beagle?

Ah sorry, I guess it got lost in copy pasting, I have a bamlist file with all 20 bam files:)
some samples have the coverage of ~8x, and some go up to 15x.
If I may ask, what amount of missing data would you expect for samples at 10x? (I'll soon do the analysis with more populations sequenced at 10x and 15x, so I'll check if I get less miss_data)
I'll run the analysis now with the whole genome of these 2 pop., and remove NaNs.

Thanks a lot for your quick replies!

If you used angsd with -minInd 10 and have 20 individuals, you should not have more than 50% missing data, no?
How does your samples have 70%?

Can you also paste here the extended output of some sites with missing data?

I also just made a faster alternative to the pruning scripts; do you think you could give it a try and see how it goes?