Output of GetAllCallableSites.py for mutational burdens
YiqunCao opened this issue · 4 comments
Hi,
I would to calculate the mutational burdens for each cell type, but not sure which number is the callable sites from the output of GetAllCallableSites.py. I attached a few lines from the output file sample_coverage_cell_count.report.tsv here (~2,700 cells):
What does each column mean? Also, could you please described in slightly detail how to calculate the mutational burdens using the output numbers? Thank you!
Dear user,
Thanks for using SComatic and thanks for bringing up this topic.
Regarding the "callable sites" question, it is important to take into account that there are two types of values when we speak about coverage (Cov column):
- one based on the number of cells with at least one read at a given position (NC column),
- another based on the depth or number of reads at each site (DP column).
To clarify this concept and the file format, it is much easier to understand it with one example (using the values shown in your screenshot):
In the B_memory showed in your attached figure, when Cov == 5:
- DP shows the number of sites with 5 reads = 472874
- NC shows the number of sites with 5 cells = 2242427
You have this site counting value for each coverage (up to 150 by default). By looking at these values, you can get the number of callable sites based on the minimum coverage (Cov) that you want. For instance, if you want to get the callable sites with at least 10 cells, you should sum the column NC for all rows in the cell type with Cov >= 10.
In our manuscript, we computed the mutation load at cell-type resolution by using a minimum Cov >= 5 and the next formula:
(# somatic mutations in the cell type Z) / (# callable sites in the cell type Z)
I hope it helps,
Fran
Hi Fran,
Thank you for the detailed and very helpful explanation! Regarding the "# somatic mutations in the cell type Z", may I confirm that I can just count the number of rows in the file sample.calling.step2.pass.tsv for each cell type without any more filtering?
thanks,
Elaine
Yes, as far as you only take the PASS mutations
Hi Fran,
Just wanted to clarify what this means:
count the number of rows in the file sample.calling.step2.pass.tsv for each cell type without any more filtering
Does it mean to count the number of non-NA rows per celltype column in the sample.calling.step2.pass.tsv file?
Thanks!
Alex