/saturation

:sponge: Estimate sequencing saturation for GEX, VDJ, and ADT data from the 10x Genomics platform.

Primary LanguageRMIT LicenseMIT

saturation.R

DOI

Table of contents:

Introduction

Here is an R script saturation.R for estimating sequencing saturation from a GEX, VDJ, or ADT dataset from the 10x Genomics platform.

The script uses the binomial distribution to downsample the reads and estimate a saturation curve. This can be helpful to determine if a sequencing experiment has enough reads.

10xgenomics.com gives us this formula for sequencing saturation:

Sequencing Saturation = 1 - (n_deduped_reads / n_reads)

Here is my illustration of the relationship between sequencing saturation and reads per unique molecular identifier (UMI):

We can compute reads per UMI from the saturation, and vice versa:

d <- data.frame(sat = seq(0, 1, length.out = 1001))
d$rpu <- 1 / (1 - d$sat)

Learn more about sequencing saturation from the 10xgenomics.com documentation:

File formats:

Getting started

Install the dependencies:

install.packages(
  c("data.table", "ggplot2", "ggtext", "glue", "optparse", "pbapply", "scales", "stringr", "BiocManager")
)
BiocManager::install("rhdf5")

See output for an example of the output files.

Usage examples

GEX

Rscript saturation.R --out output --file molecule_info.h5
Reading molecule_info.h5
Estimating GEX saturation for 519882 barcodes
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
Writing output/saturation-gex.tsv
Writing output/total_reads-vs-saturation-gex.pdf

VDJ

Rscript saturation.R --out output/tcr --file all_contig_annotations.csv
Reading all_contig_annotations.csv
INFO: Removing '-1' from the end of each barcode
Estimating VDJ saturation for 20233 barcodes
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
Writing output/tcr/saturation-vdj.tsv
Writing output/tcr/total_reads-vs-saturation-vdj.pdf

ADT

Rscript saturation.R --out output --file Batch_1A_ADT.stat.csv.gz
Reading Batch_1A_ADT.stat.csv.gz
Estimating ADT saturation curve for 729372 barcodes
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
Writing output/saturation-adt.tsv
Writing output/total_reads-vs-saturation-adt.pdf
Writing output/saturation-adt-feature.tsv
Writing output/histogram-saturation-adt-feature.pdf

Running parallel jobs with rush

Install rush by Wei Shen:

go install github.com/shenwei356/rush

Then make a list of input files and pass it to rush:

ls /project/cellranger_output/*/{molecule_info.h5,all_contig_annotations.csv,*.stat.csv.gz} > files.txt

# Run 16 jobs in parallel, capture outputs from each job in one file
rush -i files.txt -o rush-saturation.txt -j16 'Rscript saturation.R --out out/{/%} --file {}'