Table of contents:
Here is an R script saturation.R
for estimating sequencing saturation from a
GEX, VDJ, or ADT dataset from the 10x Genomics platform.
The script uses the binomial distribution to downsample the reads and estimate a saturation curve. This can be helpful to determine if a sequencing experiment has enough reads.
10xgenomics.com gives us this formula for sequencing saturation:
Sequencing Saturation = 1 - (n_deduped_reads / n_reads)
Here is my illustration of the relationship between sequencing saturation and reads per unique molecular identifier (UMI):
We can compute reads per UMI from the saturation, and vice versa:
d <- data.frame(sat = seq(0, 1, length.out = 1001))
d$rpu <- 1 / (1 - d$sat)
Learn more about sequencing saturation from the 10xgenomics.com documentation:
- What is sequencing saturation?
- How is sequencing saturation calculated?
- How much sequencing saturation should I aim for?
File formats:
- What is the molecule_info.h5 file? (GEX)
- What is the all_contig_annotations.csv file? (VDJ)
- What is the sample.stat.csv.gz file? (ADT)
Install the dependencies:
install.packages(
c("data.table", "ggplot2", "ggtext", "glue", "optparse", "pbapply", "scales", "stringr", "BiocManager")
)
BiocManager::install("rhdf5")
See output for an example of the output files.
Rscript saturation.R --out output --file molecule_info.h5
Reading molecule_info.h5
Estimating GEX saturation for 519882 barcodes
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
Writing output/saturation-gex.tsv
Writing output/total_reads-vs-saturation-gex.pdf
Rscript saturation.R --out output/tcr --file all_contig_annotations.csv
Reading all_contig_annotations.csv
INFO: Removing '-1' from the end of each barcode
Estimating VDJ saturation for 20233 barcodes
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
Writing output/tcr/saturation-vdj.tsv
Writing output/tcr/total_reads-vs-saturation-vdj.pdf
Rscript saturation.R --out output --file Batch_1A_ADT.stat.csv.gz
Reading Batch_1A_ADT.stat.csv.gz
Estimating ADT saturation curve for 729372 barcodes
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
Writing output/saturation-adt.tsv
Writing output/total_reads-vs-saturation-adt.pdf
Writing output/saturation-adt-feature.tsv
Writing output/histogram-saturation-adt-feature.pdf
Install rush by Wei Shen:
go install github.com/shenwei356/rush
Then make a list of input files and pass it to rush:
ls /project/cellranger_output/*/{molecule_info.h5,all_contig_annotations.csv,*.stat.csv.gz} > files.txt
# Run 16 jobs in parallel, capture outputs from each job in one file
rush -i files.txt -o rush-saturation.txt -j16 'Rscript saturation.R --out out/{/%} --file {}'