natsuhiko/rasqual

How can reference allele bias be calculated per SNP

kwcurrin opened this issue · 6 comments

Hi Natsuhiko,

Could you explain how reference allele bias can be calculated per SNP? I understand how reference allele bias can be calculated genome-wide per individual by taking the average reference allele ratio across all SNPs per individual. However, I don't understand how this can be done individually per SNP. We don't know if the reference allele bias is real or technical for a given SNP unless we remove technical bias with a program like WASP. Estimating the bias between individuals for a given SNP also seems problematic because all individuals should have allelic imbalance if it is real.

Thanks,

Kevin

Hi Kevin,

As you know RASQUAL estimates the QTL effect (Pi value) jointly from the total read counts (between individuals) and allele specific (AS) counts (within each individual). In an ideal situation (with no bias in AS counts), the Pi estimates from the two data sources should be identical. However, in reality, we know there are a lot of biases affecting the Pi estimate from AS counts. The reason why we share the Pi value between the two data sources is to stabilise Pi estimation.

The reference allele bias is the one that mostly confounds the Pi estimate, since we tend to map more reads with the reference allele than the alternative. If the Pi estimate is skewed by AS counts towards the reference allele, there will be a discrepancy between QTL signals estimated only from total read counts and only from AS counts. The Phi value in RASQUAL simply captures the balance between these signals. Intuitively, it is a residual (fake) QTL signal only from AS counts. Note that the Phi is shared across all SNPs, but not a single SNP (as far as you used multiple feature SNPs).

Best regards,
Natsuhiko

Hi Natsuhiko,

That helps me understand the model a lot better. That is very clever!

I had two followup questions to make sure I understand things correctly:

  1. When modeling only allele-specific counts with --as-only, is the between individual signal still used for estimating pi even though the between signal counts aren't used in the overall QTL call? If not, how is reference allele bias estimated with --as-only?
  2. The phi estimate is shared across SNPs for the same feature, but these SNPs can have different reference allele biases. Is this because the pi from each SNP's AS counts is compared separately to the same between-individual count pi for the feature?

Thanks,

Kevin

Sorry, there was a typo in question 1, I meant phi when I wrote pi:

  1. When modeling only allele-specific counts with --as-only, is the between individual signal still used for estimating phi even though the between signal
    counts aren't used in the overall QTL call? If not, how is reference allele bias estimated with --as-only?

Hi Kevin,

  1. If you set --as-only, it means that RASAUL only uses the AS counts, but not total counts (between-individual signal is completely ignored). So, if there is only one feature SNP which is also the regulatory SNP (or in perfect LD with the fSNP), it is technically impossible to estimate Phi. Although prior probabilities on Pi and Phi may distinguish the true QTL signal from reference bias. It is also not a problem if you use both total and AS counts even when rSNP=fSNP (as I mentioned).

In reality, there are multiple fSNPs within a feature region and they are not in perfect LD with the rSNP, we can still estimate both Pi and Phi only from AS counts, because Phi captures allelic imbalance specifically toward reference alleles, whereas Pi captures the allelic imbalance only seen at heterozygous individuals at the rSNP. Intuitively, all AS counts at heterozygous fSNPs are skewed toward the reference allele, it is likely to be the reference bias. Whereas, it is a true QTL signal, only AS counts at heterozygous fSNPs that links to the heterozygous rSNP is skewed toward either the reference or alternative allele. Of course, it is much harder to estimate Pi only with AS counts, I would say.

  1. Phi captures average reference bias across all fSNPs. As you mentioned, the degree and direction of reference bias are different between fSNPs. In an extreme case in which there are two fSNPs in perfect LD, but the reference allele of a SNP is the alternative allele for the other SNP (correlation=-1 between dosages of the two SNPs), the reference bias is canceled out and Phi~0.5. In reality, the residual of reference biases (cancelling each other) is down to Phi.

Best regards,
Natsuhiko

Hi Natsuhiko,

Thanks! That answers both of my questions.

However, I am nervous about reference allele bias estimates with --as-only for small feature regions. Our ATAC-seq peaks aren't very wide (median width ~500bp, upper quartile ~800bp), so I don't think there will be enough SNPs within many of the peaks that aren't in high LD to give good estimates. For example, of my 1,011 significant SNP-peak pairs at FDR<5%, 212 of the peaks have 3 or fewer fSNPs and 522 (over half) have 5 or fewer fSNPs. This isn't even considering that many of these SNPs may be highly correlated.

I realize that this wouldn't be as much of an issue for larger features, like genes or broad histone marks, but it makes me nervous for narrow open chromatin peaks.

Thanks,

Kevin

Hi Kevin,

I think you can check the distribution of Pi for QTLs with small number of fSNPs and rSNP in the target feature. If the distribution is skewed toward 0 (Pi<0.5 for potential reference bias), you might need to incorporate a stringent filtering for those QTLs.

Best regards,
Natsuhiko