natsuhiko/rasqual

Correlated peaks not sharing QTLs

kwcurrin opened this issue · 12 comments

Hi Natsuhiko,

We have a very strong chromatin QTL that looks great between variant 1:109817590 and peak11858 in our data. After looking in the browser, we found three nearby peaks that are strongly correlated with peak 11858 (peaks 11859-61). However, RASQUAL doesn't identify any significant associations for these 3 additional peaks, including with variant 1:109817590. We are trying to figure out what could be causing this. We would be greatful for your input. Things I have tried so far are below:

I began by making sure that these peaks are correlated after normalizing for library size and adjusting for the first 4 PCs (which is what I do for the QTL analysis). I then calculated
correlations:
Pearson correlation (r) of peaks adjusted for library size and first 4 PCs
peaks 11858 and 11859:
0.7953069

peaks 11858 and 11860:
0.7622177

peaks 11858 and 11861:
0.7843578

So they are very strongly correlated even after all the transformations.

I then regressed each of these peak counts against the genotype of 1:109817590 using negative binomial regression (so no allelic imbalance). I did this using the glm.nb function in R. I normalized
the peak counts by library size and used first 4 PCs as covariates in the regression:
Peak11858:
beta: 0.70
p-value: < 2e-16

peak 11859:
beta: 0.32
p-value: 9.65e-09

peak 11860:
beta: 0.26
p-value: 8.67e-08

peak 11861:
beta: 0.44
p-value: 8.21e-11

All are very significant, with 11858 (the peak that RASQUAL identified as having a significant cQTL with 1:109817590) being the strongest by p-value and beta. However, the RASQUAL p-values are weak for everything except 11858:
(All of these are raw p-values not adjusted for multiple testing)
peak11858: 8.529386e-11
peak11859: 0.03805979
peak11860: 0.1320015
peak11861: 0.0252746

I next ran RASQUAL with these 4 peaks using --population-only to see what impact allelic imbalance has on the results. Variant 1:109817 is no longer the lead variant for any of the 4 peaks, but the p-values are below:
peak11858: 0.00209815
peak11859: 0.1014721
peak11860: 0.3730409
peak11861: 0.02556151

Removing allelic imbalance greatly reduced the significance of the association between 1:109817590 and peak11858. However, the p-value for peak11858 when --population-only is used is still much stronger than the other 3 peaks.

When --population-only is used, 1:109817590 is no longer the lead variant for any of the 4 peaks. The new lead variant for peak11858 is
1:109814880
p-value: 0.0007529742

However, variant 1:109814880 is not strongly associated with the other three peaks. The p-values are below:
peak11859: 0.08443492
peak11860: 0.2146224
peak11861: 0.1164564

Do you have any ideas about why these 4 peaks don't share cQTL variants given that they are highly correlated. I'm curious as to why RASQUAL ran in --population-only mode differs so much from the glm.nb function in R.

Thanks for any help you can provide!

Kevin

Hi Kevin,

Rasqual assumes the raw feature count is linear with genotype dosage, such that

E[y] = a + b x

where y is the raw feature count and x is the genotype. While glm.nb assumes the feature count is multiplicative, such that

log(E[y]) = a + b x.

Also the assumption of overdispersion is slightly different between Rasqual and glm.nb. This is because you would get different results.

In addition, we introduce the offset k in the model, that can account for the library size. Did you fit glm.nb with the same offset k?

Best regards,
Natsuhiko

Hi Natsuhiko,

Thanks for the explanation!

I use the same library size offsets for the glm.nb function as I do for RASQUAL (DESeq2 size factors). For glm.nb, I do this by just getting the normalized counts:
c.n <- counts(dds,normalized=T)

and then rounding the result so that they are integers. For RASQUAL, I pass these size factors as the appropriate offsets.

I tried rerunning glm.nb with link="identity", to more closely match the RASQUAL model. With the exception of peak11860, the identity link gives less significant p-values than the log link. However, the identity link p-values are still very significant for all peaks:
peak11858:
identity link:
beta: 441.0617
p-value: < 2e-16
log link:
beta: 0.7025221
p-value: < 2e-16

peak11859:
identity link:
beta: 64.83557
p-value: 3.07e-06
log link:
beta: 0.3232209
p-value: 9.65e-09

peak11860:
identity link:
beta: 170.4512
p-value:1.63e-08
log link:
beta: 0.2644081
p-value: 8.67e-08

peak11861:
identity link:
beta: 54.73061
p-value: 3.13e-07
log link:
beta: 0.4350428
p-value: 8.21e-11

Is the difference in dispersion handling between RASQUAL and glm.nb different enough to account for these big differences in p-values even when the identity link is used?

Thanks!

Kevin

Hi Kevin,

Can you work out the Pi value from the glm.nb with identity link? It should be

Pi = (a+2b)/(a+b)

where "a" and "b" are obtained from the glm.nb (E[y]=a+b x). The Pi value should be compatible with Rasqual Pi value (on column 12). If they are similar, then the problem is down to the overdispersion estimation. Glm.nb assumes constant overdispersion for different genotype groups, but Rasqual assumes the larger the mean of y the smaller the overdispersion is (which makes sense for many sequencing data).

Best regards,
Natsuhiko

Hi Natsuhiko,

I tried this for one peak:
peak11858 pi:
(307.8323+2*441.0617)/(307.8323+441.0617)
= 1.588951

Am I supposed to multiply the result by 0.5?
0.5*1.588951
= 0.7944754

The RASQUAL pi when --population-only is used is 0.685397.

Thanks,

Kevin

Hi Kevin,

Sorry, you are right. It should be

Pi = (a+2b)/(2(a+b)).

Best regards,
Natsuhiko

No problem at all!

So it looks like the glm.nb pi (0.7944754) is a lot higher than the RASQUAL pi (0.685397). Do you have any ideas about why this is? (I could have also run one or the other program incorrectly).

Hi Kevin,

What happens if you don't fit any principal component as covariates both for Rasqual and glm.nb? Rasqual uses two-step optimisation. First, it fits only covariates and to update offsets, and then it fits genotype to map QTLs. It might cancel out the genotype effect by fitting covariates first.

Best regards,
Natsuhiko

Hi Natsuhiko,

It looks like not including covariates only slightly changes pi calculations for glm.nb:
glm.nb pi calculations:
With PCs used as covariates:
peak11858: 0.7944754
peak11859: 0.6576503
peak11860: 0.6305391
peak11861: 0.7054491

Without covariates:
peak11858: 0.8043158
peak11859: 0.6715795
peak11860: 0.6256997
peak11861: 0.7066314

However, removing covariates from RASQUAL increases pi values for all but one sample, and the change is quite large (0.04-0.05):
RASQUAL pi values:
With covariates:
peak11858 0.69
peak11859 0.59
peak11860 0.54
peak11861 0.60

Without covariates:
peak11858 0.73
peak11859 0.64
peak11860 0.54
peak11861 0.64

The pi values without covariates are much larger for glm.nb than for RASQUAL. Does this have to do with how RASQUAL adjusts for library size?

Thanks!

Kevin

Hi Kevin,

Glm.nb fits covariates and genotype at the same time. While Rasqual fits covariates first, which may reduce the power if genotype and covariates are correlated. How many sample size do you have? It is more likely that genotype can correlate with covariates with smaller sample sizes.

This can also be the prior distribution on Pi in Rasqual as well. We use relatively strong prior (around 0.5) on pi to reduce the false positive calls.

It's difficult to conclude which model would be telling you the truth, though...

Best regards,
Natsuhiko

Hi Natsuhiko,

My sample size is small (n=20). However, I wonder why RASQUAL reports lower pi values even when there are no covariates. For example, peak11861 has a pi of 0.64 from RASQUAL but a pi of 0.71 from glm.nb when just genotypes are used with no covariates. Does the prior distribution impact the actual value of pi, or just the resulting p-value?

Thanks,

Kevin

Hi Kevin,

When Pi estimate goes away from 0.5, the prior pulls back Pi to 0.5 unless there is a strong data support. Because Pi should be mostly distributing around 0.5 according to our prior knowledge. Due to your modest sample size, usually a big data support is not expected and prior distribution dominates in parameter estimation. As a result, Pi is being closer to 0.5 and P-value goes up (statistical significance goes down).

You can play with the shape parameters (alpha and beta) of the beta distribution on Pi by setting the option "-ABPI" for rasqual command. For example, "-ABPI 1.0" gives you the prior distribution on Pi is non-informative (uniform distribution). However, if you choose the non-informative prior or closer to non informative (-ABPI closer to 1.0), rasqual's parameter inference becomes unstable and takes longer time to converge especially when sample size is small.

Best regards,
Natsuhiko

Hi Natsuhiko,

Thanks for the explanation! That makes sense. I will keep the prior distribution at the default of 0.5.

Thanks again for all of your help with this! I will close the issue.

Kevin