How to get p values for significant LD?
beausoleilmo opened this issue · 5 comments
I ran something like this:
ngsLD --geno $SUBBEAGLE/$SUBBEAGLE.gz \
--n_ind $nind \
--n_sites $nsites \
--pos $POSFILE \
--max_kb_dist 0 \
--max_snp_dist 0 \
--min_maf 0.001 \
--extend_out \
--n_threads $SLURM_CPUS_PER_TASK \
--outH ${SUBBEAGLE.}/$SUBBEAGLE.ld
==> Input Arguments:
geno: $SUBBEAGLE/$SUBBEAGLE.gz
probs: false
log_scale: false
n_ind: 482
n_sites: 15
pos: $SUBBEAGLE/$SUBBEAGLE.pos.gz (WITHOUT header)
max_kb_dist (kb): 0
max_snp_dist: 0
min_maf: 0.001000
ignore_miss_data: false
call_geno: false
N_thresh: 0.000000
call_thresh: 0.000000
rnd_sample: 1.000000
seed: 1691677158
extend_out: true
out: $SUBBEAGLE/$SUBBEAGLE.ld (WITH header)
n_threads: 1
verbose: 1
version: 1.1.1 (Sep 22 2022 @ 15:06:57)
==> GZIP input file (not BINARY)
> Reading data from file...
> Header found! Skipping line...
==> Calculating MAF for all sites...
==> Getting sites coordinates
==> Launching threads...
==> Waiting for all threads to finish...
==> Freeing memory...
And now have a .ld file with the chi2 column.
I loaded the file, in R, and got:
datalist = lapply(all.ld.files,
function(x) read.table(x,
header=F,
col.names = c("site1", "site2",
"dist", "r^2_ExpG",
"D", "D'", "r^2",
"sample_size", "maf1", "maf2",
"hap00", "hap01", "hap10", "hap11",
"hap_maf1", "hap_maf2", "chi2", "loglike", "nIter"
)))
# Calculate p values from chi square values
datalist$pvals = pchisq(datalist$chi2, df = 1, lower.tail = TRUE)
# Adjusting p-values
datalist$pvals.adj = p.adjust(p = datalist$pvals, method = "fdr")
Basically, all the positions that have significant LD have r2 = 0. Not sure if I'm doing something incorrect here.
Here's a glimpse at the output
site1 site2 dist r.2_ExpG D D. r.2 sample_size maf1 maf2 hap00 hap01 hap10 hap11 hap_maf1 hap_maf2 chi2 loglike nIter pvals pvals.adj sign r.2.rd dp.rd
1 chr1:96001606 chr2:46021459 Inf 0.037464 0.007195 0.392740 0.040863 482 0.070539 0.019710 0.918336 0.011125 0.061954 0.008585 0.070539 0.019710 0.040863 0 5 0.160197402 0.2306321 0.04 0.39
2 chr1:96001606 chr3:37364972 Inf 0.359250 0.044935 0.697829 0.387171 482 0.070539 0.087137 0.893405 0.036055 0.019458 0.051082 0.070539 0.087137 0.387171 0 5 0.466209795 0.4799218 0.39 0.70
3 chr1:96001606 chr4:17465385 Inf 0.020404 0.004569 0.338483 0.022247 482 0.070539 0.014523 0.920531 0.008929 0.064946 0.005593 0.070539 0.014523 0.022247 0 5 0.118568106 0.2215031 0.02 0.34
4 chr1:96001606 chr5:19425299 Inf 0.071550 0.016190 0.329240 0.079785 482 0.070539 0.052905 0.896477 0.032983 0.050618 0.019921 0.070539 0.052905 0.079785 0 6 0.222411016 0.2730302 0.08 0.33
5 chr1:96001606 chr5:19434057 Inf 0.000381 0.001102 0.026575 0.000434 482 0.070539 0.044606 0.889103 0.040358 0.066291 0.004248 0.070539 0.044606 0.000434 0 7 0.016620861 0.1437960 * 0.00* 0.03*
6 chr1:96001606 chr6:1161490 Inf 0.000607 0.000665 0.068970 0.000657 482 0.070539 0.010373 0.920484 0.008977 0.069143 0.001397 0.070539 0.010373 0.000657 0 5 0.020449147 0.1437960 * 0.00* 0.07*
7 chr1:96001606 chr7:16108831 Inf 0.004083 0.004049 0.077763 0.004728 482 0.070539 0.056017 0.881444 0.048016 0.062539 0.008000 0.070539 0.056017 0.004728 0 7 0.054819703 0.1801618 0.00 0.08
8 chr1:96001606 chr8:33199177 Inf 0.037418 0.008408 0.335390 0.041084 482 0.070539 0.026971 0.912800 0.016661 0.060229 0.010310 0.070539 0.026971 0.041084 0 5 0.160624133 0.2306321 0.04 0.34
[...]
22 chr10:46021459 chr3:33200027 Inf 0.000022 0.000095 0.004952 0.000020 482 0.019710 0.023859 0.956997 0.023293 0.019144 0.000566 0.019710 0.023859 0.000020 0 3 0.003568236 0.1088302 * 0.00* 0.00*
30 chr9:37364972 chr4:19434057 Inf 0.004494 -0.003790 0.975188 0.004238 482 0.087137 0.044606 0.868354 0.044509 0.087040 0.000096 0.087137 0.044606 0.004238 0 19 0.051905558 0.1801618 0.00 0.98
Also, why sometimes D' is =1 when r2 is =0 ? And at other times, D' is =0 and r2 is =0... see lines 6, 22, and 30 for example.
I'm wondering if there is not something odd going on here:
Line 305 in 7001165
Also, why is the loglike hard coded as '0.0'?
Line 348 in 7001165
Basically, all the positions that have significant LD have r2 = 0. Not sure if I'm doing something incorrect here.
It might be because of low allele frequencies; did you check sites with higher frequencies?
Also, why sometimes D' is =1 when r2 is =0 ? And at other times, D' is =0 and r2 is =0... see lines 6, 22, and 30 for example.
I'm wondering if there is not something odd going on here:
Line 305 in 7001165
I can see what you mean, but it is possible for Dprime to be 1
while r^2 is close to 0
(see e.g. last slide). I quickly doubled checked the formulas and they seem correct (see here), but let me know if you find an error.
Also, why is the doglike hard coded as '0.0'?
Line 348 in 7001165
Outputting the final log likelihood has not been fully implemented yet, so that is just a place holder (for now).
Just noted that pchisq(datalist$chi2, df = 1, lower.tail = TRUE)
, might be more appropriate with lower.tail = FALSE
. By 'low allele frequencies' do you mean that both maf1 and maf2 should be high for significance to be achieved or you mean freq_A and freq_B?
dfdf$freq_A = dfdf$hap00 + dfdf$hap01
dfdf$freq_B = dfdf$hap00 + dfdf$hap10
These frequencies are actually quite high in this data.
0.929461 0.980290
I'm not that familiar with C code.
But, actually, if I look here: https://github.com/fgvieira/ngsLD/blob/afb35a355cc28926c1588ab8a643c0e65941f579/ngsLD.cpp#L330C1-L332C1
Shouldn't the frequency be maf1 mfa2 and 1-maf1 and 1-maf2? This is an attempt to calculate the expected heterozygotes and from the R^2
## r^2 = D^2/(PA*PB*Pa*Pb)
# r2.stat = (D.stat / sqrt(datalist$maf1 * datalist$maf2 * (1-datalist$maf1) * (1-datalist$maf2)))^2
So it seems that maf1 and maf2 are PA, PB, Pa and Pb. But later in the frequency calculation, it uses the genotype frequencies (h00, h01, h10, h11). Maybe I'm misunderstanding something here, but it seems that there is a variable misplaced here. Is that correct?
Edit: No it's the same! so no problem here!
I was curious in looking at the raw beagle file and try to get a sense of the allele frequency in the file.
Basically, I wrote a naive (without model) script that finds the maximum probability for each genotype in each individual. When there is a 0.5 0.5 probability, I just take a random one (I only had a few). I then used the genotypes to get the number of each allele and divide by the total number of alleles (nb of individuals*2).
> all1/all.tot = 0.3568301
> all2/all.tot = 0.6431699
But when I look at the MAF from the ngsLD output file, I have "maf2 = 0.026971" for the same position... check line 8 for "chr8:33199177". How could that be?
Can you send a small example beagle file so I can reproduce your results?
I sent the data over email.