ngsLD removed a large number of linked sites
biozzq opened this issue · 8 comments
Dear @fgvieira
I try ngsLD using one chromosome. The geno file contains 2781316 sites. After filtering for maf over 0.05 according to the coresponding "mafs.gz" file, 848930 sites were retained. However, i found the output file (9.out) only contains 24279 sites. Why does so much sites have not been consideration during analysis? Maybe I have ignored some parameters. Thank you.
ngsLD --geno 9.beagle.gz --n_ind 116 --n_sites 2781316 --out 9.out --min_maf 0.05 --ignore_miss_data --n_threads 6 --pos 9.pos
Using following command to count the number of sites in the 9.out perl -lane '$hash{$F[0]}=1; $hash{$F[1]}=1; END { foreach my $pos (keys %hash){print $pos}}' 9.out | wc -l
Kind regards,
Zhuqing
can you include the screen output?
The screen output is as follow:
==> Input Arguments:
geno: 9.beagle.gz
probs: false
log_scale: false
n_ind: 116
n_sites: 2781316
pos: 9.pos
max_kb_dist (kb): 100
max_snp_dist: 0
min_maf: 0.050000
ignore_miss_data: true
call_geno: false
N_thresh: 0.000000
call_thresh: 0.000000
rnd_sample: 1.000000
seed: 1575939355
extend_out: false
out: 9.out
n_threads: 6
verbose: 1
version: 1.1.0 (Dec 9 2019 @ 17:33:53)
==> GZIP input file (not BINARY)
> Reading data from file...
> Header found! Skipping line...
==> Calculating MAF for all sites...
==> Getting sites coordinates
==> Launching threads...
==> Waiting for all threads to finish...
==> Freeing memory...
Done!
Hi @biozzq,
try increasing the max_kb_dist
optiont that, by default, is set at 100 (kb).
Dear @fgvieira
I have tried to increase the option to 200; however the number of effective variants is the same. I do not know why so much variants have been filtered out during analysis.
kind regards,
Dear @fgvieira
After using the option --extend_out
, I found the MAFs calculated by ngsLD
and ANGSD
are quite different. I think this should be the main problem. For example,
#SNPs ANGSD ngsLD
9:7212 0.260520 0.051724
9:7401 0.350019 0.064655
9:7446 0.385674 0.060345
9:7454 0.372413 0.060345
9:7455 0.370337 0.060345
9:9169 0.004497 0.060345
9:9181 0.198779 0.073276
9:9199 0.485126 0.086207
9:9217 0.454970 0.056034
9:9233 0.004388 0.060345
Dear @biozzq
can you try adding the option --probs
to your command line?
beagle
files have genotype likelihoods, and so that options needs to be specified.
Dear @fgvieira
Yes, after adding the option, the results make sense. However, the MAFs are slightly different between ANGSD
and ngsLD
.
#SNP ANGSD ngsLD
9:6448 0.123104 0.149309
9:6449 0.354376 0.357990
9:6450 0.113894 0.142327
9:6454 0.253353 0.266687
9:6466 0.086997 0.116411
9:6470 0.099667 0.126569
9:6505 0.260607 0.263638
9:6531 0.176540 0.190154
9:6551 0.198538 0.211470
9:6554 0.458579 0.457871
9:6556 0.186685 0.202046
9:6558 0.194416 0.209137
9:6579 0.170711 0.186947
9:6585 0.048472 0.066536
9:6586 0.068588 0.096961
9:6594 0.073048 0.103607
9:6613 0.048612 0.083948
9:6615 0.037074 0.069246
9:6619 0.046807 0.077852
yes, there might some differences, probably due to one of two things:
- precision. genotype likelihoods are not exactly the same, since ANGSD calculates them from BAM files (as doubles) and writes them to a beagle file with less decimal places.
- number of EM iterations
However, the MAF filer in ngsLD
is just intended as a coarse filter step to get rid of low freq alelles and speed up LD calculation. If your dataset is already filtered, then you can omit that option (by default filters MAF < 0.001)