Code for columns of the single cell level genotype result
lijiang825 opened this issue · 5 comments
Hi,
I was able to install and run SComatic and this is a great tool! I am wondering if you would share a table outlining what each column in the output of the final single cell genotyping step represents? Currently, there are 12 columns, but a subset of them look confusing to me, including "ALT_expected", "Cell_type_expected", "Num_cells_expected", "CB", "Cell_type_observed", "Base_observed", "Num_reads". Thank you very much!
Best,
Li
Hi, I actually have the same issue. Ran the tool, all good, lovely! Really great documentation too, I was able to run it super fast. But now I'm struggling with this output table. What I want to know is for each barcode, which mutations are there. Would be super helpful to get a bit more guidance.
Dear users,
Sorry for the lack of documentation for this output. Please find here a more detailed description of each one of the columns:
Column | Description |
---|---|
#CHROM | Chromosome carrying the mutation |
Start | Start genomic coordinate |
End | End genomic coordinate |
REF | Reference allele |
ALT_expected | Alternative allele as described in the input file (--infile ) |
Cell_type_expected | Cell types harbouring the mutation as described in the input file (--infile ) |
Num_cells_expected | Number of expected cells carrying the mutation as described in the input file (--infile ) |
CB | Unique cell barcode analysed |
Cell_type_observed | Cell type attributed to the analysed CB according to the input metadata file (--meta ) |
Base_observed | Allele observed in this CB |
Num_reads | Number of reads carrying the Base_observed |
Let's understand this table with an example. Looking at our SComatic example data, we will focus on the variant site chr10-29559501 and the SComatic/example_data/results/SingleCellAlleles/Epithelial_cells.single_cell_genotype.tsv file generated.
#CHROM Start End REF ALT_expected Cell_type_expected Num_cells_expected CB Cell_type_observed Base_observed Num_reads
chr10 29559501 29559501 A T Epithelial_cells 2 AGTCTTTGTGCATCTA Epithelial_cells A 5
chr10 29559501 29559501 A T Epithelial_cells 2 CCCTCCTAGGCTAGGT Epithelial_cells A 1
chr10 29559501 29559501 A T Epithelial_cells 2 GGGTCTGTCTTGAGGT Epithelial_cells T 2
chr10 29559501 29559501 A T Epithelial_cells 2 GTCCTCAAGGCTCATT Epithelial_cells T 2
chr10 29559501 29559501 A T Epithelial_cells 2 GAGTCCGAGGGTGTTG Epithelial_cells A 2
The columns ALT_expected
, Cell_type_expected
and Num_cells_expected
correspond to the values observed in the --infile Example.calling.step2.pass.tsv
, so they represent the calls at cell type resolution.
In contrast, the columns CB
, Cell_type_observed
, Base_observed
and Num_reads
correspond to the allele observations at unique cell resolution when interrogating the bam files.
Each CB can be presented in the output file in as many rows as different alleles are found per cell, although in most cases, we only observed one allele per cell (so one row per unique CB). In order to find the alleles harbouring the called mutation, we have to look for those rows (unique CBs) where ALT_expeced == Base_observed
and Cell_type_expected == Cell_type_observed
. In general terms, CBs not accomplishing these conditions can be understood as noise or non-mutated cells.
Thanks,
Fran
Hi Fran,
thank you for the detailed explanation. That makes a lot of sense now. What would be your advice on how to use this info for plotting how many mutations are found in each individual cell?
You could do this by using R or Python.
The basic strategy would be:
- Compute how many rows per cell accomplish the
ALT_expeced == Base_observed
andCell_type_expected == Cell_type_observed
. Basically, the number of mutations per cell. - It is essential to consider the number of callable sites per cell, as it will affect the number of mutations detected. To perform this correction, you will need to compute the number of callable sites per cell using this functionality. You can use these callable sites to compute for example the mutation load per cell and MB (# Mutations per cell / # Callable sites per cell).
- Plot the resulting values using your more desired software. I would ignore those cells with a very low number of callable sites.
- Generally, you will see an enrichment of cells at 0. This is due to the low number of callable sites per cell in this type of approaches (scRNA-seq) and a low mutation load (depending on the cancer or cell type).
Cheers,
Fran
Thank you so much! I will attempt to do this today :)