sequencing saturation
Closed this issue · 4 comments
Hi,
I can get sequencing saturation curve using cellranger from 10x genomics single cell data. How to get it using simpleaf ?
This is the definition of sequencing saturation (https://kb.10xgenomics.com/hc/en-us/articles/115003646912-How-is-sequencing-saturation-calculated):
How is sequencing saturation calculated?
Question: How is "Sequencing Saturation" calculated?
Answer: The web_summary.html output from cellranger count includes a metric called "Sequencing Saturation". This metric quantifies the fraction of reads originating from an already-observed UMI. More specifically, this is the fraction of confidently mapped, valid cell-barcode, valid UMI reads that are non-unique (match an existing cell-barcode, UMI, gene combination).
The formula for calculating this metric is as follows:
Sequencing Saturation = 1 - (n_deduped_reads / n_reads)
where
n_deduped_reads = Number of unique (valid cell-barcode, valid UMI, gene) combinations among confidently mapped reads.
n_reads = Total number of confidently mapped, valid cell-barcode, valid UMI reads.
Note that the numerator of the fraction is n_deduped_reads, not the non-unique reads that are mentioned in the definition. n_deduped_reads is a degree of uniqueness, not a degree of duplication/saturation. Therefore we take the complement of (n_deduped_reads / n_reads) to measure saturation.
And this is the way to generate the plot:
This plot shows the Sequencing Saturation metric as a function of downsampled sequencing depth in mean reads per cell, up to the observed sequencing depth. Sequencing Saturation is a measure of the observed library complexity, and approaches 1.0 (100%) when all converted mRNA transcripts have been sequenced. The slope of the curve near the endpoint can be interpreted as an upper bound to the benefit to be gained from increasing the sequencing depth beyond this point. The dotted line is drawn at a value reasonably approximating the saturation point.
Could we subsample the RAD file outputed by simpleaf ?
Hi @wangjiawen2013,
The necessary values to compute these plots are in the set of output features of alevin-fry
. I'm cc'ing @ygao61 and @csoneson who have created such plots from the alevin-fry
output before and who can instruct you as to the best way to generate such a plot.
--Rob
Hi @wangjiawen2013,
Below are the steps you can follow to generate the sequence saturation plot:
- 1.) Required Data : featureDump.txt file. You can find it under the
alevin-fry
quant output folder.
Each row in this file contains summarized information for each cell, which will be used to generate the plot. - 2.) Filter Cells with Insufficient Detected Genes:
Filter out the cells(rows) with fewer than 20 detected genes, based on the CorrectedReads column. - 3.) Sort the Data.
Sort the cells in ascending order by the CorrectedReads column to create the correctReadsArray. This array contains the "CorrectedReads" values of the cells, arranged in ascending order. - 4.) Calculate Sequence Saturation for each cell.
Sequence Saturation is calculated as "1 - DedupRate" and stored as an array, like seqSaturationArray. - 5.) Downsample and calculate the accumulated average array.
We performed down-sampling by iteratively slicing subsets of the correctReadsArray and seqSaturationArray with increasing sizes (multiples of 10). For example, the first subset includes the first 10 elements (0,10), the second subset includes the first 20 elements (0,20), and so on. Calculate and store the average values of these accumulated subsets. - 6.) Generate a scatterplot.
The accumulated average of the correctReadsArray will be the x-axis for "Mean Reads per Cell" while the accumulated average of the seqSaturationArray will be the y-axis to represent "Sequencing Saturation". You can then use the two accumulated average arrays to generate the scatterplot.
--Yuan
Got it. Thanks!