These are the scripts used for CellGenIT for uniform processing of scRNA-seq - both 10X and quite a few other types (see below for supported platforms). All listed methods use STAR aligner to align reads to the reference genome.
STAR
of version 2.7.9a or above is recommended (2.7.10a is the latest and greatest, as of August'22). The newest update includes the ability to correctly process multi-mapping reads, and adds many important options and bug fixes.
In order to use settings that closely mimic those of Cell Ranger
v4 or above (see explanations below, particularly --clipAdapterType CellRanger4
option), STAR
needs to be re-compiled from source with make STAR CXXFLAGS_SIMD="-msse4.2"
(see this issue for more info). If you get the Illegal instruction error, that's what you need to do.
We also have a Dockerfile which builds with specific versions of all the tools used. Once the container is built you can do cat /versions.txt
and it will show the versions of the tools in the container.
The issues of reference genome and annotation used for mouse and human scRNA-seq experiments are described here in some detail.
Briefly,
- there are several standard annotation versions used by
Cell Ranger
; They are referred to as versions1.2.0
,2.1.0
,3.0.0
, and2020-A
; - the references are filtered to remove pseudogenes and small RNAs; exact filtering scripts are available here;
- number of genes in
Cell Ranger
filtered human reference is about 35k; in full human annotation there are about 60k.
Once you have downloaded the needed fasta and GTF files, and applied the necessary filtering (see the 10x genomics link above), run the following command using 16 CPUs/64 Gbs of RAM:
STAR --runThreadN 16 --runMode genomeGenerate --genomeDir STAR --genomeFastaFiles $FA --sjdbGTFfile $GTF
STARsolo
reference is not different from a normal STAR
reference - STARsolo
workflow is invoked using the options listed below. Thus, it can be run on both Cell Ranger
filtered and un-filtered index. We use the filtered reference to match the Cell Ranger
output as closely as possible.
All CellGenIT pre-made STAR
references are located in /nfs/cellgeni/STAR/
. Barcode whitelist files are located in /nfs/cellgeni/STAR/whitelists
.
By default, all processing is done using 16 CPUs and 128 Gb of RAM. Using the latest settings, you should be able to process most 10x experiments with 64 Gb of RAM even when you need a sorted BAM output. Farm typically has ~8 Gb of RAM per core, so 8 CPUs/64 Gb RAM, or 16 CPUs/128 Gb RAM is probably optimal.
Full scripts with the latest settings are available in /scripts
(there are several scripts according to 10x chemistry version; e.g. starsolo_3p_v3.sh
should be used for v3 of 3' 10x, while starsolo_5p_v2.sh
should be used for v2 of 5'. The scripts contain many options that frequently change; some of which will be explained below. In general, commands are tuned in such way that the results will be very close to those of Cell Ranger
v4 and above.
Before running, barcode whitelists need to be downloaded from here.
Below are the explanations for some of the options (note that 5' experiments always use 737K-august-2016.txt
barcode file):
--runDirPerm All_RWX
allows a directory readable by all users, which becomes an issue when sharing results on Farm;--soloCBwhitelist $BC --soloBarcodeReadLength 0 --soloUMIlen $UMILEN --soloStrand $STR
are settings that change with the used 10x chemistry:
10X VERSION | BC | UMILEN | STR |
---|---|---|---|
3' v1 | 737K-april-2014_rc.txt | 10 | Forward |
3' v2 | 737K-august-2016.txt | 10 | Forward |
3' v3, v3.1 | 3M-february-2018.txt | 12 | Forward |
5' v1.1, v2 | 737K-august-2016.txt | 10 | Reverse |
5' v3 | 737K-august-2016.txt | 12 | Reverse |
multiome | 737K-arc-v1.txt | 12 | Forward |
--soloUMIdedup 1MM_CR --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR --clipAdapterType CellRanger4 --outFilterScoreMin 30
are options that define UMI collapsing, barcode collapsing, and read clipping algorithms that are closest to ones used byCell Ranger
;--soloCellFilter EmptyDrops_CR
specifies the cell filtering algorithm used in EmptyDrops, which is the default algorithm in later versions ofCell Ranger
;--soloFeatures Gene GeneFull Velocyto
output conventional (exon-only) UMI counts, as well as exon+intron UMI counts (analog ofCell Ranger
premrna option), as well as matrices preprocessed forVelocyto
;--soloMultiMappers EM
is to count multimappers (on by default in v3.0+ of these scripts; does not influence the main output, but creates an additional matrix in/raw
subdir ofGene
andGeneFull
);--readFilesCommand zcat
is used if your input fastq files are gzipped;--outReadsUnmapped Fastx
to output the unmapped reads that can be used for metagenomic analysis;- options grouped as
$SORTEDBAM
should be used if you need a genomic bam file; otherwise, use$NOBAM
.
Actual STAR
command being run (note the order of read files passed to --readFilesIn
):
STAR --runThreadN $CPUS --genomeDir $REF --readFilesIn $R2 $R1 --runDirPerm All_RWX $GZIP $BAM \
--soloType CB_UMI_Simple --soloCBwhitelist $BC --soloBarcodeReadLength 0 \
--soloCBlen $CBLEN --soloUMIstart $((CBLEN+1)) --soloUMIlen $UMILEN --soloStrand $STRAND \
--soloUMIdedup 1MM_CR --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR \
--soloCellFilter EmptyDrops_CR --clipAdapterType CellRanger4 --outFilterScoreMin 30 \
--soloFeatures Gene GeneFull Velocyto --soloOutFileNames output/ features.tsv barcodes.tsv matrix.mtx \
--soloMultiMappers EM --outReadsUnmapped Fastx
Although some 3' 10X datasets are sequenced with a "long" R1, usually they are aligned as single-end, since it would take a very long R1 to accommodate for barcode, UMI, TSO + adapter, and a poly-A tail. On the other hand, 5' 10X datasets, especially ones sequenced with 2x150 bp paired-end reads, could be trimmed and aligned as paired-end (this is what Cell Ranger
does, too). Effective STARsolo
command for paired-end processing is as follows (note the 5' clipping flag, read input order, and flipped soloStrand
because of the changed read order):
STAR --runThreadN $CPUS --genomeDir $REF --readFilesIn $R1 $R2 --runDirPerm All_RWX $GZIP $BAM \
--soloBarcodeMate 1 --clip5pNbases 39 0 \
--soloType CB_UMI_Simple --soloCBwhitelist $BC --soloCBstart 1 --soloCBlen $CBLEN \
--soloUMIstart $((CBLEN+1)) --soloUMIlen $UMILEN --soloStrand Forward \
--soloUMIdedup 1MM_CR --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR \
--soloCellFilter EmptyDrops_CR --outFilterScoreMin 30 \
--soloFeatures Gene GeneFull Velocyto --soloOutFileNames output/ features.tsv barcodes.tsv matrix.mtx \
--soloMultiMappers EM --outReadsUnmapped Fastx
For plate-based methods that don't use UMIs (such as SMART-Seq and SMART-Seq2), STARsolo
can be used as well. Fastq files for these methods usually come as separate, paired-end files; all of these should be listed in a manifest file - plain text, tab-separated file containing three columns per line: 1) full path to R1; 2) full path to R2; 3) cell name or ID.
Example of a script used to process Smart-seq2 data can be found in /scripts/starsolo_ss2.sh
. Key parameters that could be adjusted are --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3
; the higher they are, the less permissive is the alignment (we use default settings usually). Lower values can help you "rescue" a larger proportion of reads with high adapter content (see below for adapter trimming). Actual STAR
command being run:
STAR --runThreadN $CPUS --genomeDir $REF --runDirPerm All_RWX --readFilesCommand zcat $SORTEDBAM \
--soloType SmartSeq --readFilesManifest ../$TAG.manifest.tsv \
--soloUMIdedup Exact --soloStrand Unstranded \
--soloFeatures Gene GeneFull --soloOutFileNames output/ features.tsv barcodes.tsv matrix.mtx
Often, SMART-seq2 reads can benefit from trimming adapters, which can be turned on using --clip3pAdapterSeq <3' adapter sequence>
option. Alternatively, /scripts/bbduk.sh
can be used to trim adapters from reads prior to the alignment and quantification (this is the preferred option).
Default approach used by Cell Ranger
(and STARsolo
scripts above) is to discard all reads that map to multiple genomic locations with equal mapping quality. This approach creates a bias in gene expression estimation. Pseudocount-based methods correctly quantify multimapping reads, but generate false counts due to pseudo-alignment errors. These issues are described in good detail here.
If you would like to process multimappers, add the following options: --soloMultiMappers Uniform EM
(or simply --soloMultiMappers EM
; this is on by default in v3.0+ of these scripts). This will generate an extra matrix in the /raw
output folders. There will be non-integer numbers in the matrix because of split reads. If the downstream processing requires integers, you can round with a tool of your liking (e.g. awk
).
As of STAR
v2.7.10a, multimapper counting still does not work for SMART-seq2 or bulk RNA-seq processing.
STARsolo is very flexible and can be used with almost any scRNA-seq method, provided you know the library structure - i.e. where cell barcodes, UMIs, and biological parts of the read are located in the sequencing fragment or reads. A great source of information about scRNA-seq library structures is this page.
Currently, our scripts directory provides dedicated scripts for
- Drop-seq;
- inDrops;
- Microwell-seq;
- STRT-seq.
Please contact CellGenIT
if you need to process an unusual dataset.
If you've used these scripts to process multiple 10X samples, you can get a quick look at the results by copying solo_QC.sh
script from this repo to the directory with STARsolo
output folders, and running
./solo_QC.sh | column -t
The script is designed for 10X or other droplet-based methods; the output will make a lot less sense for SMART-seq2.