Value Error in prune_ngsLD.py

Question

Value Error in prune_ngsLD.py

TrishikaS opened this issue 2 years ago · 16 comments

Hi again:)

I am trying now to run prune_ngsLD.py script, but I get an error:

Checking if file is gzipped...
Reading in data...
Filtering edges by distance...
Filtering edges by weight...
Beginning pruning by dropping heaviest position...
Starting with 77988 positions with 393573 edges between them...
Traceback (most recent call last):
File "/home/ubuntu/ngsLD/scripts/prune_ngsLD.py", line 160, in
map_property_values(G.ep["weight"], edge_weight, lambda x: int(x * weight_precision))
File "/home/ubuntu/anaconda3/envs/ngsLD_N/lib/python3.11/site-packages/graph_tool/init.py", line 1191, in map_property_values
libcore.property_map_values(u._Graph__graph,
File "/home/ubuntu/ngsLD/scripts/prune_ngsLD.py", line 160, in
map_property_values(G.ep["weight"], edge_weight, lambda x: int(x * weight_precision))
^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot convert float NaN to integer

Would you have any suggestion and ideas how to fix this issue?

Thanks in advance

Answer 1 · 2023-02-16T12:19:39.000Z

Do you have any missing values on your input file?

Answer 2 · 2023-02-16T12:33:19.000Z

Yes, I generate LD_input file from beagle file. And I guess I do have some NaNs in there.
Should I and how, remove them?

Answer 3 · 2023-02-16T12:35:50.000Z

If this helps, these are my steps ...

angsd -GL 2 -out $OUT/chr25_gl -minMapQ 30 -minQ 20 -nThreads 10 -doGlf 2 -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam $BAM\

#2.make LD input file

ngsLD --geno chr25_gl.beagle.gz --probs --ignore_miss_data --rnd_sample 0.05 --seed 1 --n_ind 20 --n_sites 78000 --pos sites --out LD_input

#3. run py script for pruning

prune_ngsLD.py --input LD_input --max_dist 50000 --min_weight 0.1 --out testLD_unlinked.pos

Answer 4 · 2023-02-16T12:37:27.000Z

Can you paste here some lines where you have missing data?

Answer 5 · 2023-02-16T12:42:08.000Z

25:449 25:3558 3109 0.135633 -0.000000 nan nan
25:449 25:4145 3696 0.015543 -0.023482 0.999814 0.035279
25:449 25:4525 4076 0.006558 -0.017165 0.224540 0.007366
25:449 25:5174 4725 0.169403 0.169219 0.999968 0.999910
25:449 25:6389 5940 0.032531 -0.103020 0.999272 0.227801
25:449 25:7569 7120 0.153657 -0.080698 1.000000 0.157663
25:449 25:7800 7351 0.121338 -0.125412 0.999920 0.310249
25:449 25:14675 14226 0.135194 -0.145148 0.999973 0.408850
25:449 25:15575 15126 0.133985 -0.145086 0.999972 0.408497
25:449 25:30626 30177 0.024597 -0.000000 nan nan

Answer 6 · 2023-02-16T14:38:40.000Z

I suspect that the problem is that some of those sites are not variable. If so, you can try increasing the option --min_maf 0.03.

If not, can you send me the information for sites 25:449 and 25:3558 on the maf and beagle files?

Answer 7 · 2023-02-16T15:35:48.000Z

Thanks for your reply,
I just tried it out with --min_maf 0.03 and it gave the same error. When I remove lines with nan (around 10% of all sites) the script works just fine.
So here is the information for sites 25:449 and 25:3558

Mafs
25 449 T C 0.269521 4.790291e-08 7

Beagle
25_449 3 1 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.666622 0.333333 0.000044 0.333333 0.333333 0.333333 0.000044 0.333333 0.666622 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.666622 0.333333 0.000044 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.665962 0.333333 0.000704 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.799979 0.200021 0.000000 0.000044 0.333333 0.666622 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.665962 0.333333 0.000704

Mafs
25 3558 C T 0.285733 4.107843e-08 7

Beagle
25_3558 1 3 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.666622 0.333333 0.000044 0.000044 0.333333 0.666622 0.000044 0.333333 0.666622 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.666622 0.333333 0.000044 0.666622 0.333333 0.000044 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.665962 0.333333 0.000704 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.666622 0.333333 0.000044 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333

Answer 8 · 2023-02-17T07:25:19.000Z

What is the coverage of these samples?
Can you try running ngsLD with --min_maf 0.03 but without --ignore_miss_data.

Answer 9 · 2023-02-17T09:00:54.000Z

The coverage is ~10x and we want to run analysis also with 4x and 2x coverage in the future.
I tried running ngsLD without --ignore_miss_data, and again the same error appeared:(

Checking if file is gzipped...
Reading in data...
Filtering edges by distance...
Filtering edges by weight...
Beginning pruning by dropping heaviest position...
Starting with 76695 positions with 395834 edges between them...
Traceback (most recent call last):
File "/home/ubuntu/ngsLD/scripts/prune_ngsLD.py", line 160, in
map_property_values(G.ep["weight"], edge_weight, lambda x: int(x * weight_precision))
File "/home/ubuntu/anaconda3/envs/ngsLD_N/lib/python3.11/site-packages/graph_tool/init.py", line 1191, in map_property_values
libcore.property_map_values(u._Graph__graph,
File "/home/ubuntu/ngsLD/scripts/prune_ngsLD.py", line 160, in
map_property_values(G.ep["weight"], edge_weight, lambda x: int(x * weight_precision))

Answer 10 · 2023-02-17T09:24:51.000Z

But with 10x coverage, how do you have so much missing data (those two sites have 70% missing data)?! Is it whole genome or targeted sequencing?

Answer 11 · 2023-02-17T09:34:27.000Z

yeah 70% seems a lot. It's whole genome sequencing, I used only one chromosome just to check if the script is running ok.

Answer 12 · 2023-02-17T09:50:41.000Z

Could it be some repetitive regions? If so, maybe it is good to use a minInd when running angsd.

Answer 13 · 2023-02-17T13:30:13.000Z

So, I tested with -minind 10 (I have 20 ind overall), and then I ran ngsLD with min_maf 0.03 and no missing data, and I still get the same error... Quick explanation: Those 20 samples are from 2 populations (10 ind per pop) with Fst 2%. The species I'm working with has low diversity anyways.

Despite this issue, should the py script ignore NaNs? Is there some way to circumvent this issue? When I deleted lines with nan, the py script worked just fine.

Answer 14 · 2023-02-17T15:41:18.000Z

It should be ok to remove NaNs, but I am just a bit surprised by the amount of missing data on samples that were sequenced at 10x.

Just noticed that it seems you run angsd with just one BAM? But then, how do you get the beagle?

Answer 15 · 2023-02-17T15:53:52.000Z

Ah sorry, I guess it got lost in copy pasting, I have a bamlist file with all 20 bam files:)
some samples have the coverage of ~8x, and some go up to 15x.
If I may ask, what amount of missing data would you expect for samples at 10x? (I'll soon do the analysis with more populations sequenced at 10x and 15x, so I'll check if I get less miss_data)
I'll run the analysis now with the whole genome of these 2 pop., and remove NaNs.

Thanks a lot for your quick replies!

Answer 16 · 2023-03-20T12:00:43.000Z

If you used angsd with -minInd 10 and have 20 individuals, you should not have more than 50% missing data, no?
How does your samples have 70%?

Can you also paste here the extended output of some sites with missing data?

I also just made a faster alternative to the pruning scripts; do you think you could give it a try and see how it goes?