fgvieira/ngsLD

ngsLD removed a large number of linked sites

biozzq opened this issue · 8 comments

Dear @fgvieira

I try ngsLD using one chromosome. The geno file contains 2781316 sites. After filtering for maf over 0.05 according to the coresponding "mafs.gz" file, 848930 sites were retained. However, i found the output file (9.out) only contains 24279 sites. Why does so much sites have not been consideration during analysis? Maybe I have ignored some parameters. Thank you.
ngsLD --geno 9.beagle.gz --n_ind 116 --n_sites 2781316 --out 9.out --min_maf 0.05 --ignore_miss_data --n_threads 6 --pos 9.pos

Using following command to count the number of sites in the 9.out perl -lane '$hash{$F[0]}=1; $hash{$F[1]}=1; END { foreach my $pos (keys %hash){print $pos}}' 9.out | wc -l

Kind regards,
Zhuqing

can you include the screen output?

The screen output is as follow:

==> Input Arguments:
	geno: 9.beagle.gz
	probs: false
	log_scale: false
	n_ind: 116
	n_sites: 2781316
	pos: 9.pos
	max_kb_dist (kb): 100
	max_snp_dist: 0
	min_maf: 0.050000
	ignore_miss_data: true
	call_geno: false
	N_thresh: 0.000000
	call_thresh: 0.000000
	rnd_sample: 1.000000
	seed: 1575939355
	extend_out: false
	out: 9.out
	n_threads: 6
	verbose: 1
	version: 1.1.0 (Dec  9 2019 @ 17:33:53)

==> GZIP input file (not BINARY)
> Reading data from file...
> Header found! Skipping line...
==> Calculating MAF for all sites...
==> Getting sites coordinates
==> Launching threads...
==> Waiting for all threads to finish...
==> Freeing memory...
Done!

Hi @biozzq,

try increasing the max_kb_dist optiont that, by default, is set at 100 (kb).

Dear @fgvieira

I have tried to increase the option to 200; however the number of effective variants is the same. I do not know why so much variants have been filtered out during analysis.

kind regards,

Dear @fgvieira

After using the option --extend_out, I found the MAFs calculated by ngsLD and ANGSD are quite different. I think this should be the main problem. For example,

#SNPs      ANGSD     ngsLD
9:7212	0.260520	0.051724
9:7401	0.350019	0.064655
9:7446	0.385674	0.060345
9:7454	0.372413	0.060345
9:7455	0.370337	0.060345
9:9169	0.004497	0.060345
9:9181	0.198779	0.073276
9:9199	0.485126	0.086207
9:9217	0.454970	0.056034
9:9233	0.004388	0.060345

Dear @biozzq

can you try adding the option --probs to your command line?
beagle files have genotype likelihoods, and so that options needs to be specified.

Dear @fgvieira

Yes, after adding the option, the results make sense. However, the MAFs are slightly different between ANGSD and ngsLD.

#SNP      ANGSD       ngsLD
9:6448  0.123104        0.149309
9:6449  0.354376        0.357990
9:6450  0.113894        0.142327
9:6454  0.253353        0.266687
9:6466  0.086997        0.116411
9:6470  0.099667        0.126569
9:6505  0.260607        0.263638
9:6531  0.176540        0.190154
9:6551  0.198538        0.211470
9:6554  0.458579        0.457871
9:6556  0.186685        0.202046
9:6558  0.194416        0.209137
9:6579  0.170711        0.186947
9:6585  0.048472        0.066536
9:6586  0.068588        0.096961
9:6594  0.073048        0.103607
9:6613  0.048612        0.083948
9:6615  0.037074        0.069246
9:6619  0.046807        0.077852

yes, there might some differences, probably due to one of two things:

  • precision. genotype likelihoods are not exactly the same, since ANGSD calculates them from BAM files (as doubles) and writes them to a beagle file with less decimal places.
  • number of EM iterations

However, the MAF filer in ngsLD is just intended as a coarse filter step to get rid of low freq alelles and speed up LD calculation. If your dataset is already filtered, then you can omit that option (by default filters MAF < 0.001)