how to find genes with signature of purifying selection from dNdSCV results
shaghayeghsoudi opened this issue · 3 comments
I am new in cancer genomics and just started working with your dNdSCV package. I found it very straightforward and user friendly but I have a question which I am a bit confused and I am bothering you by email to ask for your help.
I very much like to find genes with signature of purifying selection.
I just do not understand which column of “sel_cv” should I use to calculate that? Or which column of sel_cv infer to that?
Even when I look at the example of breast cancer coming with the package the majority of “wimps_cv” column are 0 NOT 1 if we assume the majority of the genome is under neutral selection.
For the that example dataset I did this:
dim(sel_cv[sel_cv$wmis_cv> 0.25 & sel_cv$wmis_cv<0.75,] = 244
And ended up with 244 genes. Does that mean 244 genes are under negative selection?
I really really appreciate if you could just help me to solve this issue for myself, I am struggling with this for a few days.
Hello,
Thank you for your interest in dndscv.
Detecting genes under purifying selection in cancer is difficult. Because the density of somatic mutations is low in cancer genomes, typically on the order of 1-10 mutations per million bases, you need very large datasets (typically thousands of samples) to have enough mutations per gene to be able to detect a depletion of non-synonymous mutations in specific genes (see this figure and the corresponding paper for a power analysis). In a small dataset (e.g. tens of cancer exomes), by chance you would expect most genes to have zero somatic mutations in them, even under neutrality. This explains your observation of genes with dN/dS=0 but with non-significant p-values (note that dN/dS ratios in the output table are only point estimates, to calculate confidence intervals you can use the geneci function in the package).
In addition, with large datasets it can be easy to find false signals of apparent negative selection when not carefully accounting for the variation of the mutation rates across genes, for germline SNP contamination (a common problem), for mutation calling biases and for mutational signatures (e.g. not using dN/dS models with trinucleotide substitution models). In general, careful studies of very large datasets have found signals of negative selection in cancer to be very weak (not zero, however), and I am yet to see convincing evidence of negative selection in individual genes in cancer (although I expect true cases to emerge -as well as more reports on false signals- as datasets grow larger). You can read more about this in the papers below:
https://www.sciencedirect.com/science/article/pii/S0092867417311364
https://www.nature.com/articles/ng.3987
https://www.nature.com/articles/s41588-019-0532-6
Now, having said that, let me answer your question about how to look for negative selection with dndscv. The p-values and q-values that you see in dndsout$sel_cv are two-sided neutrality tests. This means that they inform about deviations away from dN/dS=1, in either direction, and so they do not distinguish between negative and positive selection. To look for negative selection, you need to select those genes with wmis_cv<1 (i.e. dN/dS for missense mutations <1) and look at the corresponding p-value for missense mutations (pmis_cv). You should not trust the two-sided q-values calculated by dndscv, as they can be inflated by genes under positive selection. Instead, you need to calculate one-sided q-values for negative selection.
This is how you can do it using the example dataset provided in the package (qmis_negsel are the q-values for negative selection):
library("dndscv")
data("dataset_simbreast", package="dndscv")
dndsout = dndscv(mutations)
dndsout$sel_cv$pmis_negsel = dndsout$sel_cv$pmis_cv / 2 # Initialising the one-sided vector of p-values for missense mutations
dndsout$sel_cv$pmis_negsel[dndsout$sel_cv$wmis_cv>1] = 1 # Setting one sided p-values for genes with dN/dS>1 to 1
dndsout$sel_cv$qmis_negsel = p.adjust(dndsout$sel_cv$pmis_negsel, method="BH") # FDR adjustment (q-values)
head(dndsout$sel_cv[order(dndsout$sel_cv$qmis_negsel),]) # Looking at the most significant genes for negative selection
I hope this helps,
Inigo
Thanks so much Inigo, very helpful
I am just closing the issue as I got the answer of my question