cortes-ciriano-lab/SComatic

Code for columns of the single cell level genotype result

lijiang825 opened this issue · 5 comments

Hi,

I was able to install and run SComatic and this is a great tool! I am wondering if you would share a table outlining what each column in the output of the final single cell genotyping step represents? Currently, there are 12 columns, but a subset of them look confusing to me, including "ALT_expected", "Cell_type_expected", "Num_cells_expected", "CB", "Cell_type_observed", "Base_observed", "Num_reads". Thank you very much!

Best,
Li

Hi, I actually have the same issue. Ran the tool, all good, lovely! Really great documentation too, I was able to run it super fast. But now I'm struggling with this output table. What I want to know is for each barcode, which mutations are there. Would be super helpful to get a bit more guidance.

Dear users,
Sorry for the lack of documentation for this output. Please find here a more detailed description of each one of the columns:

Column Description
#CHROM Chromosome carrying the mutation
Start Start genomic coordinate
End End genomic coordinate
REF Reference allele
ALT_expected Alternative allele as described in the input file (--infile)
Cell_type_expected Cell types harbouring the mutation as described in the input file (--infile)
Num_cells_expected Number of expected cells carrying the mutation as described in the input file (--infile)
CB Unique cell barcode analysed
Cell_type_observed Cell type attributed to the analysed CB according to the input metadata file (--meta)
Base_observed Allele observed in this CB
Num_reads Number of reads carrying the Base_observed

Let's understand this table with an example. Looking at our SComatic example data, we will focus on the variant site chr10-29559501 and the SComatic/example_data/results/SingleCellAlleles/Epithelial_cells.single_cell_genotype.tsv file generated.

#CHROM	Start	End	REF	ALT_expected	Cell_type_expected	Num_cells_expected	CB	Cell_type_observed	Base_observed	Num_reads
chr10	29559501	29559501	A	T	Epithelial_cells	2	AGTCTTTGTGCATCTA	Epithelial_cells	A	5
chr10	29559501	29559501	A	T	Epithelial_cells	2	CCCTCCTAGGCTAGGT	Epithelial_cells	A	1
chr10	29559501	29559501	A	T	Epithelial_cells	2	GGGTCTGTCTTGAGGT	Epithelial_cells	T	2
chr10	29559501	29559501	A	T	Epithelial_cells	2	GTCCTCAAGGCTCATT	Epithelial_cells	T	2
chr10	29559501	29559501	A	T	Epithelial_cells	2	GAGTCCGAGGGTGTTG	Epithelial_cells	A	2

The columns ALT_expected, Cell_type_expected and Num_cells_expected correspond to the values observed in the --infile Example.calling.step2.pass.tsv, so they represent the calls at cell type resolution.

In contrast, the columns CB, Cell_type_observed, Base_observed and Num_reads correspond to the allele observations at unique cell resolution when interrogating the bam files.

Each CB can be presented in the output file in as many rows as different alleles are found per cell, although in most cases, we only observed one allele per cell (so one row per unique CB). In order to find the alleles harbouring the called mutation, we have to look for those rows (unique CBs) where ALT_expeced == Base_observed and Cell_type_expected == Cell_type_observed. In general terms, CBs not accomplishing these conditions can be understood as noise or non-mutated cells.

Thanks,
Fran

Hi Fran,
thank you for the detailed explanation. That makes a lot of sense now. What would be your advice on how to use this info for plotting how many mutations are found in each individual cell?

You could do this by using R or Python.

The basic strategy would be:

  1. Compute how many rows per cell accomplish the ALT_expeced == Base_observed and Cell_type_expected == Cell_type_observed. Basically, the number of mutations per cell.
  2. It is essential to consider the number of callable sites per cell, as it will affect the number of mutations detected. To perform this correction, you will need to compute the number of callable sites per cell using this functionality. You can use these callable sites to compute for example the mutation load per cell and MB (# Mutations per cell / # Callable sites per cell).
  3. Plot the resulting values using your more desired software. I would ignore those cells with a very low number of callable sites.
  4. Generally, you will see an enrichment of cells at 0. This is due to the low number of callable sites per cell in this type of approaches (scRNA-seq) and a low mutation load (depending on the cancer or cell type).

Cheers,
Fran

Thank you so much! I will attempt to do this today :)