natsuhiko/rasqual

AS counts and permutation test

gouthamatla opened this issue · 4 comments

Hi All,

I am using RASQUAL on RNA-Seq data. Its been doing an amazing job so far.

I have couple of questions:

I saw in a paper that they split the genes in to "Genes with fSNPs" i.e likely to have allelic counts and "Genes without fSNPs" and use different options. For "Genes without fSNPs" is it recommended to use "--population-only" ? Is it very important to do it that way ? Also, how does it effect results if we use default options on all genes ?

Regarding "--random-permutation", when I use this option, I often noticed that the results doesnt make much sense compared with the ones ran without random permutations. I would like to get some insights into this.

Final question is: How should we filter RASQUAL output ?
RASQUAL gives a column "Log_10 Benjamini-Hochberg Q-value". Can we use this column to filter the results ? I am using all the default values to run RASQUAL.

Hi,

You don't need to use '--population-only' option as far as you are happy with combining both between-individual and allele specific signatures to map QTLs. This option allows you to map QTLs only with total read counts which is the conventional strategy.

In terms of random permutation test, could you explain some more about what you have encountered? Did you observe more significant statistics with the permutation test?

In order to filter QTLs, what exactly do you want to obtain from RASQUAL output? Do you want to detect eQTL genes or to detect which SNP is likely to be causal for a gene? The column 10 (BH Q-value) gives multiple testing corrected P-values for SNPs tested in a cis-regulatory window for one gene. You need to detect eQTL genes, you need further apply a multiple testing correction to control statistical significance.

Thanks for the reply.

I am sharing two genes from my data here.

Did you observe more significant statistics with the permutation test?

Here is the output from rasqual with and without random permutations:

tabix TMED6_500kb.vcf.gz chr16:68877149-69885712 | rasqual -y data/your.Y.bin -k data/your.K.bin -n 86 --no-posterior-update -j 1 -m 3 -l 2221 -s 69377149,69381691,69383428,69385444 -e 69377543,69381839,69383554,69385712 -t -f TMED6 -r

Output:
TMED6 rs144242147 chr16 68937606 C T 0.197674 0.189072 0.999274 -1.3162706378 11.0799341176 0.397003 0.001835 0.632068 3.327736 5.959015 3 1614 4 3 68937606 -794.558567 0 0.947118 0.986792

tabix TMED6_500kb.vcf.gz chr16:68877149-69885712 | rasqual -y data/your.Y.bin -k data/your.K.bin -n 86 --no-posterior-update -j 1 -m 3 -l 2221 -s 69377149,69381691,69383428,69385444 -e 69377543,69381839,69383554,69385712 -t -f TMED6

Output
TMED6 rs153060 chr16 69377553 A G 0.325581 0.855769 1.000000 -24.7054131464 123.2114400158 0.213925 0.001140 0.597929 7.236464 5.824290 3 1614 6 4 69377553 -794.558567 0 0.947118 0.991744

With out permutations, I get the variant rs153060 which is highly significant. Its also known (published) that rs153060 is associated with the change in expression of TMED6. I am wondering why its not significance if I use random permutations option.

In order to filter QTLs, what exactly do you want to obtain from RASQUAL output? Do you want to detect eQTL genes or to detect which SNP is likely to be causal for a gene? The column 10 (BH Q-value) gives multiple testing corrected P-values for SNPs tested in a cis-regulatory window for one gene. You need to detect eQTL genes, you need further apply a multiple testing correction to control statistical significance.

Sorry, I am not from genetics background. I understood that, for example in above case, rasqual output is telling that rs153060 is associated with the expression of TMED6.

I did not get this part

You need to detect eQTL genes

I was thinking that TMED6 is the eQTL gene for rs153060. Do you mean, rs153060 could also be associated with other genes and I need to do further do multiple testing correction to get most likely eQTL gene for rs153060 ? In that case, how do I proceed ?

Hi,

The random permutation basically shuffles the sample labels so that the link between genotype and phenotype is broken. This is because you didn't observe a significant eQTL SNP. This function is designed for controlling statistical significance under the null hypothesis.

In terms of detecting eQTL genes among all genes you've tested, you need to adjust for multiple testing. However, you would test multiple SNPs in a cis regulatory window for each gene and the number of SNPs is usually different. Some gene is located in a SNP dense region, while other gene is located in a SNP sparse region. As a consequence, you will observe higher significance for genes in a SNP dense region (because the more you test, the higher the statical significance you observe). The Q-value on the 10th column of the output adjusts for the multiple testing of SNP density.

Now you have the Q-value of 10^-24 for TMED6. I don't doubt that the gene is an eQTL (Q-values is significantly below 0.05). However, I suppose you have tested other genes as well (for example 19,999 other genes). Then you need to correct for another multiple testing for multiple genes genome-wide. In theory, you will observe 1000 genes with Q-values below 0.05, even if all the 20,000 genes are NOT eQTLs. Naively you could multiply the number of genes tested by the Q-value (Bonferroni procedure), you would use, for example, Benjamini-Hochberg Procedure for more powerful multiple testing correction. You can use p.adjust implemented in R to do this for a vector of Q-values for all genes tested.

Thanks for taking time to explain.