fgvieira/ngsLD

How to get p values for significant LD?

beausoleilmo opened this issue · 5 comments

I ran something like this:

ngsLD --geno $SUBBEAGLE/$SUBBEAGLE.gz \
      --n_ind $nind \
      --n_sites $nsites \
      --pos $POSFILE \
      --max_kb_dist 0 \
      --max_snp_dist 0 \
      --min_maf 0.001 \
      --extend_out \
      --n_threads $SLURM_CPUS_PER_TASK \
      --outH ${SUBBEAGLE.}/$SUBBEAGLE.ld
==> Input Arguments:
        geno: $SUBBEAGLE/$SUBBEAGLE.gz
        probs: false
        log_scale: false
        n_ind: 482
        n_sites: 15
        pos: $SUBBEAGLE/$SUBBEAGLE.pos.gz (WITHOUT header)
        max_kb_dist (kb): 0
        max_snp_dist: 0
        min_maf: 0.001000
        ignore_miss_data: false
        call_geno: false
        N_thresh: 0.000000
        call_thresh: 0.000000
        rnd_sample: 1.000000
        seed: 1691677158
        extend_out: true
        out: $SUBBEAGLE/$SUBBEAGLE.ld (WITH header)
        n_threads: 1
        verbose: 1
        version: 1.1.1 (Sep 22 2022 @ 15:06:57)

==> GZIP input file (not BINARY)
> Reading data from file...
> Header found! Skipping line...
==> Calculating MAF for all sites...
==> Getting sites coordinates
==> Launching threads...
==> Waiting for all threads to finish...
==> Freeing memory...

And now have a .ld file with the chi2 column.

I loaded the file, in R, and got:

datalist = lapply(all.ld.files,
                  function(x) read.table(x, 
                                         header=F, 
                                         col.names = c("site1", "site2",
                                                       "dist", "r^2_ExpG",
                                                       "D", "D'", "r^2",
                                                       "sample_size", "maf1", "maf2", 
                                                       "hap00", "hap01", "hap10", "hap11", 
                                                       "hap_maf1", "hap_maf2", "chi2", "loglike", "nIter"
                                                       ))) 

# Calculate p values from chi square values
datalist$pvals = pchisq(datalist$chi2, df = 1, lower.tail = TRUE)

# Adjusting p-values
datalist$pvals.adj = p.adjust(p = datalist$pvals, method = "fdr")

Basically, all the positions that have significant LD have r2 = 0. Not sure if I'm doing something incorrect here.

Here's a glimpse at the output

site1               site2 dist r.2_ExpG         D       D.      r.2 sample_size     maf1     maf2    hap00    hap01    hap10    hap11 hap_maf1 hap_maf2     chi2 loglike nIter       pvals pvals.adj sign r.2.rd dp.rd
1  chr1:96001606 chr2:46021459  Inf 0.037464  0.007195 0.392740 0.040863         482 0.070539 0.019710 0.918336 0.011125 0.061954 0.008585 0.070539 0.019710 0.040863       0     5 0.160197402 0.2306321        0.04  0.39
2  chr1:96001606 chr3:37364972  Inf 0.359250  0.044935 0.697829 0.387171         482 0.070539 0.087137 0.893405 0.036055 0.019458 0.051082 0.070539 0.087137 0.387171       0     5 0.466209795 0.4799218        0.39  0.70
3  chr1:96001606 chr4:17465385  Inf 0.020404  0.004569 0.338483 0.022247         482 0.070539 0.014523 0.920531 0.008929 0.064946 0.005593 0.070539 0.014523 0.022247       0     5 0.118568106 0.2215031        0.02  0.34
4  chr1:96001606 chr5:19425299  Inf 0.071550  0.016190 0.329240 0.079785         482 0.070539 0.052905 0.896477 0.032983 0.050618 0.019921 0.070539 0.052905 0.079785       0     6 0.222411016 0.2730302        0.08  0.33
5  chr1:96001606 chr5:19434057  Inf 0.000381  0.001102 0.026575 0.000434         482 0.070539 0.044606 0.889103 0.040358 0.066291 0.004248 0.070539 0.044606 0.000434       0     7 0.016620861 0.1437960    *  0.00* 0.03*
6  chr1:96001606 chr6:1161490   Inf 0.000607  0.000665 0.068970 0.000657         482 0.070539 0.010373 0.920484 0.008977 0.069143 0.001397 0.070539 0.010373 0.000657       0     5 0.020449147 0.1437960    *  0.00* 0.07*
7  chr1:96001606 chr7:16108831  Inf 0.004083  0.004049 0.077763 0.004728         482 0.070539 0.056017 0.881444 0.048016 0.062539 0.008000 0.070539 0.056017 0.004728       0     7 0.054819703 0.1801618        0.00  0.08
8  chr1:96001606 chr8:33199177  Inf 0.037418  0.008408 0.335390 0.041084         482 0.070539 0.026971 0.912800 0.016661 0.060229 0.010310 0.070539 0.026971 0.041084       0     5 0.160624133 0.2306321        0.04  0.34
[...]
22 chr10:46021459 chr3:33200027  Inf 0.000022  0.000095 0.004952 0.000020         482 0.019710 0.023859 0.956997 0.023293 0.019144 0.000566 0.019710 0.023859 0.000020       0     3 0.003568236 0.1088302    *  0.00*  0.00*
30 chr9:37364972 chr4:19434057  Inf 0.004494 -0.003790 0.975188 0.004238         482 0.087137 0.044606 0.868354 0.044509 0.087040 0.000096 0.087137 0.044606 0.004238       0    19 0.051905558 0.1801618          0.00 0.98

Also, why sometimes D' is =1 when r2 is =0 ? And at other times, D' is =0 and r2 is =0... see lines 6, 22, and 30 for example.

I'm wondering if there is not something odd going on here:

ngsLD/ngsLD.cpp

Line 305 in 7001165

Dp = D / ( D < 0 ? -min(maf[0]*maf[1], (1-maf[0])*(1-maf[1])) : min(maf[0]*(1-maf[1]), (1-maf[0])*maf[1]) );

Also, why is the loglike hard coded as '0.0'?

ngsLD/ngsLD.cpp

Line 348 in 7001165

0.0,

Basically, all the positions that have significant LD have r2 = 0. Not sure if I'm doing something incorrect here.

It might be because of low allele frequencies; did you check sites with higher frequencies?

Also, why sometimes D' is =1 when r2 is =0 ? And at other times, D' is =0 and r2 is =0... see lines 6, 22, and 30 for example.

I'm wondering if there is not something odd going on here:

ngsLD/ngsLD.cpp

Line 305 in 7001165

Dp = D / ( D < 0 ? -min(maf[0]*maf[1], (1-maf[0])*(1-maf[1])) : min(maf[0]*(1-maf[1]), (1-maf[0])*maf[1]) );

I can see what you mean, but it is possible for Dprime to be 1 while r^2 is close to 0 (see e.g. last slide). I quickly doubled checked the formulas and they seem correct (see here), but let me know if you find an error.

Also, why is the doglike hard coded as '0.0'?

ngsLD/ngsLD.cpp

Line 348 in 7001165

0.0,

Outputting the final log likelihood has not been fully implemented yet, so that is just a place holder (for now).

Just noted that pchisq(datalist$chi2, df = 1, lower.tail = TRUE), might be more appropriate with lower.tail = FALSE. By 'low allele frequencies' do you mean that both maf1 and maf2 should be high for significance to be achieved or you mean freq_A and freq_B?

dfdf$freq_A = dfdf$hap00 + dfdf$hap01
dfdf$freq_B = dfdf$hap00 + dfdf$hap10

These frequencies are actually quite high in this data.
0.929461 0.980290

I'm not that familiar with C code.

But, actually, if I look here: https://github.com/fgvieira/ngsLD/blob/afb35a355cc28926c1588ab8a643c0e65941f579/ngsLD.cpp#L330C1-L332C1

Shouldn't the frequency be maf1 mfa2 and 1-maf1 and 1-maf2? This is an attempt to calculate the expected heterozygotes and from the R^2

## r^2 = D^2/(PA*PB*Pa*Pb)
# r2.stat = (D.stat / sqrt(datalist$maf1 * datalist$maf2 * (1-datalist$maf1) * (1-datalist$maf2)))^2

So it seems that maf1 and maf2 are PA, PB, Pa and Pb. But later in the frequency calculation, it uses the genotype frequencies (h00, h01, h10, h11). Maybe I'm misunderstanding something here, but it seems that there is a variable misplaced here. Is that correct?

Edit: No it's the same! so no problem here!

I was curious in looking at the raw beagle file and try to get a sense of the allele frequency in the file.

Basically, I wrote a naive (without model) script that finds the maximum probability for each genotype in each individual. When there is a 0.5 0.5 probability, I just take a random one (I only had a few). I then used the genotypes to get the number of each allele and divide by the total number of alleles (nb of individuals*2).

> all1/all.tot = 0.3568301 
> all2/all.tot = 0.6431699 

But when I look at the MAF from the ngsLD output file, I have "maf2 = 0.026971" for the same position... check line 8 for "chr8:33199177". How could that be?

Can you send a small example beagle file so I can reproduce your results?

I sent the data over email.