natsuhiko/rasqual

QQ plot from rasqual P-values

Closed this issue · 17 comments

Dear Natsuhiko,

I have a question about the results from my ATAC caQTL analysis:
If I plot the p-values to make a qq plot, I see something like this
download-1
where the P-values lay under the diagonal.
I use covariates and size factor files generated by rasqualtools , plus I add sex and 1K genomes first 4 PCs in the covariate file.
It seems that overall there is less deflation when I do not add the additional covariates.
Does it seem correct to you? DO you recommend any strategy to improve it?
Thank you very much,

Paola

SOrry the qq-plot is this
download

Hi Paola,

First of all, did you plot log10 P-values or log10 Q-values? I would recommend to plot Q-values (on column 10 of the output file). In addition, it is unlikely that those Q-values follow a uniform distribution. I would also recommend to plot the Q-values against the permuted Q-values which can be obtained by '-r' option.

Best regards,
Natsuhiko

Hi Natsuhiko-
Thank you !
I plotted the -log10(p-values) from the lead variant results (option -t), calculated from column 11 as follows: pchisq(result_lead[,11], 1, lower=F), against a null uniform distribution.

Now I plot colum 10 against permuted one as you suggest and have this:
download-3
Is this how I should do it ?

I wondered if it is expected that they are from the beginning ?-
I also plotted the q values from all snps (not just the lead variants) and produces similar results.
Thank you very much!
Paola

I think the above plot looks reasonable to me (the lead q-values against the lead permuted q-values). The plot suggests you can perform much better FDR calibration with the lead permuted Q-values (because points near the origin are exactly on the diagonal line). I recommend to perform genome-wide FDR correction with the lead q-values, but not with all q-values (including non-lead variants).

Natsuhiko

Thank you for your answer!
I have one more question about how lead SNPs are chosen. I understand that they are chosen randomly among those with the same, lowest, q-value for a given feature. Is it correct and why is random and not for example based on position or lowest p-value?
My only problem with having it picked random is when I compare different dataset or cell type for example having the same qtl, the reported lead variant is different....
Thank you again!!
Best,
Paola

If the lead variants have the same q-value, they also have the same p-value. If you want to get the mapping result for all variants, you just remove '-t' option to rerun RASQUAL (it may take time though).

Natsuhiko

Thank you ,
yes, I've been doing that, but I get identical q-values for (slightly) different p-values or Chi-squares....like in this example - I can reproduce the same q-values if I do p.adjust of the p-values....., which I think is what you do to get the q-value?
Untitled

Sorry, I don't really understand your problem here. As you mentioned I use the Benjamini-Hochberg method (equivalent to p.adjust with method="BH") to compute the q-values.

Best regards,
Natsuhiko

My question was about picking the lead variant for a given Feature, when multiple variants have the same q-values (ties) but different p-values. Why is it random and not the variant with the lowest p-value (or largest Chisquare), as in my example. But it's ok I can always go back to the results for all variants and chose a different lead.
Thank you very much!
Best
Paola

The lead variant was defined by the maximum chi-square value (i.e., minimum P-value) but not q-value. You should get the lead variant with '-t' which gives you the minimum P-value. If not, please let me know.

Best regards,
Natsuhiko

That's true, I did not realize that.
I thought they were chosen random if having the same qval, probably misled by the description of column 21 of the results "Random location of ties (tie lead SNP; only useful with -t option)".
Thank you very much!

Sorry just one more unrelated quick question: what --min-coverage-depth filter for fSNP did you use in your paper?
In the figures I see you show fSNPs with >20 reads but I am not sure if you apply the filter -min-coverage-depth in the calculation.
Thanks again!!!
Best

The option is always applied to your data to filter out fSNPs with low coverage depth (default is 5% of mean coverage depth at coding regions). Those SNPs are still used as tested SNPs, but the AS counts are no longer used. This option is useful when there are too many fSNPs with shallow coverage depth due to bad quality of RNA. Usually you don't need to tweak this option.

Best regards,
Natsuhiko

Thank you ! and in case of ATAC seq that would be 5% of mean coverage at accessible regions? (peaks?)

For the ATAC-seq, the average coverage depth in the peak region that you define by -s and -e.

Best regards,
Natsuhiko

ok thank you very much!!