Interpreting dndscv results
loukesio opened this issue · 7 comments
Dear Inigo,
My apologies for sending it again; I find it a bit hard to interpret the results of dndscv.
I am sequencing microbial populations over time, and I want to
- calculate dn/ds ratio for the genes
- find genes with significantly positive and negative dn/ds
- find the global dn/ds
- find the global and local mutation rates
I have an example short in size file in here: https://1drv.ms/u/s!ArOzUuixE-mggdZ2rRefV3zp_VmIlQ?e=pp8Tnt
and a screenshot of how my results look.
Here is the rda file that is needed to run the dndscv https://1drv.ms/u/s!ArOzUuixE-mggcxtAwZYZvLwSMNjBg?e=HFypvu
My questions are:
- To calculate dn/ds, I have to divide the n_mis/n_syn column. Is that right?
- To find significant genes with positive dn/ds I use this function signif_genes = sel_cv[sel_cv$qallsubs_cv<0.1, c("gene_name","qallsubs_cv")]. How could I find significant genes with negative dn/ds ?
- If i press print(dndsout$globaldnds) I do not understand what the global dn/ds value is. I might be stupid. I see a table but I can not interpret based on tutorials what it the global dn/ds value.
- If I press print(dndsout$nbreg$theta) I am taking really high values. I am doing something wrong, most probably. You have a comment about it in the tutorial.
Can I calculate the local mutation rate with your package? thank you for your time. I would appreciate any answer or guidance.
code:
_I am running the dndscv function _
dndsout = dndscv(sample, refdb="NC_012660.1.rda", cv=NULL) # run the function
sel_cv = dndsout$sel_cv %>% # I am trying to keep genes that have some mutations
filter_at(vars(n_syn,n_mis,n_non,n_spl), any_vars(.>0))
signif_genes = sel_cv[sel_cv$qallsubs_cv<0.1, c("gene_name","qallsubs_cv")] # here you can also call significant genes based on the global q values of all mutations
print(dndsout$globaldnds) # here you find global dN/Ds estimates
print(dndsout$nbreg$theta) # here you print the mutation rate
Dear Loukas,
Thanks for your questions. I should update the tutorials to explain these questions better.
To answer your questions:
- The gene-level dN/dS ratios are already calculated in the sel_cv table: wmis_cv, wnon_cv and wspl_cv are the point estimates of the dN/dS ratios for missense, nonsense and splice site mutations, respectively. (The ratios for nonsense and splice site mutations will be identical unless you use a different option in dndscv, see ?dndscv, although using a single estimate for both is preferred by default). Remember that in very sparse datasets, the point estimates have large uncertainty associated to them. You can calculate confidence intervals for gene-level dN/dS ratios using the geneci function in the package. But using the p-values associated to the dN/dS ratios may be preferable for most applications (enabling multiple testing correction).
- The p-values and q-values in the table are two-sided neutrality tests. That is, they test whether dN/dS is significantly different from 1. To calculate one-sided p-values, you can use the code I showed in this question. I did not include this feature in the package since negative selection is almost undetectable in cancer genomes, but I should consider implementing improved one sided tests. For now, I hope that the code above helps.
- print(dndsout$globaldnds) shows the table of maximum likelihood estimates for the exome-wide dN/dS ratios. The mle column (first column) contains the maximum likelihood estimates (i.e. the dN/dS ratios). The cilow and cihigh columns contain the 95% confidence intervals for these estimates. You will see that the table provides separate estimates for missense, nonsense and splice site mutations, and a global (wall) estimate (which is effectively a weighted average across mutation types and tends to be close to wmis, as missense mutations are much more numerous than nonsense and splice site mutations).
- Could you share the global dN/dS ratios that you get and your theta value? Very high theta values can be common in very sparse datasets, where the number of synonymous mutations per gene is so low that the distribution of mutations across genes could be essentially Poisson (notice that as theta tends to infinity, the negative binomial model for the variation of the mutation rate across genes approaches a Poisson model). This does not necessarily have to be a problem. Very low theta values (<<1) are more problematic as they tell you that the density of synonymous mutations varies greatly across genes, which is often a sign of problems with the dataset, at least in human somatic data.
- You can get more information on the mutation rates estimated in a dataset looking at these two outputs: dndsout$mle_submodel shows the rates estimated for the substitution model, that is the estimated rates of each substitution type (first column) and the confidence intervals around it (second and third columns). dndsout$genemuts provides a table with the expected estimated density of synonymous, missense, nonsense and splice site mutations in a given gene under neutrality (these are the basis for the dN/dS ratios per gene that you can find in the sel_cv table).
I hope this helps!
Inigo
Dear Inigo,
Thank you for your time! I really appreciate. Forgive me in advance for my stupid questions
1. What does it mean the point estimates of the dN/dS ratios? I mean, in the first row of my previous screenshot, I have three missence mutations and wmis_cv is 1734.9091. What does this 1734.9091 mean? I am a bit confused, sorry. To find the actual dn/ds ratio do I need to divide any column? (I am so sorry for asking again)
2. thank you :)
3&4. This is a screenshot of the global dn/ds.
As a global dn/ds i ll use the wall which 2.082524 and my theta values θ=927.537
5. thanks a lot :)
Dear Loukas,
Thanks. 1734.9091 is the estimated dN/dS ratio for the gene. Can you share the dndsout$genemuts table for PFLU4201? It may be easier to explain this with an example.
Inigo
Hi Loukas,
Essentially, dN/dS ratios are ratios of the non-synonymous to synonymous mutation rates of a gene, corrected for sequence composition and given a substitution model. If you only had information on this gene, seeing 3 missense mutations in the gene and 0 synonymous mutations could be consistent with a dN/dS ratio approaching infinite, but with enormously wide confidence intervals (spanning the value of 1).
However, dNdScv takes advantage of the mutations observed across all genes in the genome to fit both the substitution model and a negative binomial model for the variation of mutation rates across genes. When theta is very high, as it is in your case, the dNdScv model essentially approaches a Poisson model, where mutations are assumed to occur randomly across the genome but considering the different mutability of each trinucleotide (= the substitution model). Given its sparsity, what dNdScv is doing in your dataset is approximately the following:
- Calculate the mutation rate of each trinucleotide substitution (e.g. the number of AAA>ACA mutations observed divided by the number of AAA trinucleotides in the exome, repeating this for each of the 192 possible mutations in a trinucleotide context). This is approximately what the maximum likelihood fitting of the substitution model does (output in dndsout$mle_submodel).
- Given the sequence of a gene, you can count the number of times each trinucleotide is observed in a gene, multiply that by the mutation rate per bp of that trinucleotide and sum across all sites to obtain the expected number of mutations under neutrality. You can do this separately for synonymous, missense and nonsense sites in a gene. This calculation tell us that, given the sparsity of synonymous mutations in your dataset, in the gene PFLU4201 shown in your table above, we only expected 0.000809 synonymous mutations and 0.001731 missense mutations by chance under neutrality.
- In reality, you observe 3 missense mutations, so the maximum-likelihood estimate for the dN/dS ratio for missense mutations in this gene is 3/0.001731 ~ 1733. Of course, there is great uncertainty around this estimate. If we were using a simple Poisson model, observing 3 mutations when 0.001731 are expected by chance would yield confidence intervals for dN/dS ~ (357, 5065) with a p-value against the null hypothesis of dN/dS=1 lower than 1e-9 (to run this you can use
poisson.test(3, 0.001731)
). dNdScv does not use a simple Poisson test because it accounts for the variation of the mutation rate across genes fitting a negative binomial model, but since theta is very high in your case, using a Poisson model in the example is not far from what dNdScv is doing (and in fact the p-value estimated by dNdScv is also <1e-9).
I hope this makes sense. Essentially, the method is telling you that given the extremely low density of mutations in your dataset, seeing 3 missense mutations in the same gene is highly significant and suggestive of positive selection (you are seeing a rate or density of non-synonymous mutations 1700 times higher than expected by chance, with large confidence intervals around this dN/dS ratio as it is based on 3 mutations).
Of course, you need to make sure that this result makes sense and it is correct, and not due to a recurrent artefact in your mutation calls or to counting the same mutation three times, for example (e.g. inputting the mutations observed in a population in 3 different time points as 3 independent events, rather than as a single mutation).
Best wishes,
Inigo
Dear Inigo,
I think I got it thanks to your detailed answer. I really appreciate
with best wishes,
Loukas
Glad it helped. I am closing this issue now, but please reopen it if anything does not make sense.