Workflow for the TM3'seq manuscript.
The workflow is written using Snakemake. Dependencies are installed using Bioconda.
This workflow was designed to streamline the analysis of TM3'seq data. However, it can be used to process FASTQ files derived form any other RNA-seq protocol once the samples have been demultiplexed. The details of demultiplexing may vary depending on the protocol used for library preparation.
Starting with FASTQ files, the workflow 1) trims raw reads, 2) aligns them, and 3) counts the number of reads mapping to each gene. The output is a gene counts file that can be imported in standard software for the analisys of RNA-seq data.
- FASTQ files for each sample
- Reference sequence in FASTA format
- Gene model in GTF format
- Configuration file(s) in YAML format
combined_gene_counts.tsv
- Tab delimited file of gene counts, one row per gene and one column per samplemultiqc.html
- Quality control summary showing number of reads, trimming, mapping, and counting statisticslogs\
- Directory of log files for each job, check here first if you run into errorsworking\
- Directory containing intermediate files for each job (e.g. bam files and count files for each sample)
- FASTQ summary and QC metrics - Use FastQC to determine some basic QC metrics from the raw FASTQ files
- Trim reads - Use Trimmomatic to trim off adapter and low quality sequence from the ends of reads
- Align reads - Use STAR to aign reads to the genome, accounting for known splice junctions
- Deduplicate (optional) - Remove duplicates using nudup which utilizes unique molecular identifiers (UMI)
- Count - Use featureCounts (part of the Subread package) to quanyify the number of reads uniquely mapped to each gene
- Summarize - Combine the count files and run MultiQC to generate a summary report
-
Install conda
-
Enable the Bioconda channel
conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge
-
Clone workflow into working directory
git clone <repo> <dir> cd <dir>
-
Input data
Place demultiplexed
fastq.gz
files in adata
directory -
Edit configuration files as needed
cp config.defaults.yml myconfig.yml nano myconfig.yml # Only if running on a cluster cp cluster_config.yml mycluster_config.yml nano mycluster_config.yml
-
Install dependencies into an isolated environment
conda env create -n <project> --file environment.yml
-
Activate the environment
source activate <project>
-
Execute the workflow
snakemake --configfile "myconfig.yml" --use-conda
--configfile "myconfig.yml"
- Override defaults using the configuration found inmyconfig.yml
--use-conda
- Use conda to create an environment for each rule, installing and using the exact version of the software required (recommended)--cores
- Use at most N cores in parallel (default: 1). If N is omitted, the limit is set to the number of available cores.--cluster
- Execute snakemake rules with the given submit command, e.g.qsub
. Snakemake compiles jobs into scripts that are submitted to the cluster with the given command, once all input files for a particular job are present. The submit command can be decorated to make it aware of certain job properties (input, output, params, wildcards, log, threads and dependencies (see the argument below)), e.g.:$ snakemake –cluster ‘qsub -pe threaded {threads}’
.--drmaa
- Execute snakemake on a cluster accessed via DRMAA. Snakemake compiles jobs into scripts that are submitted to the cluster with the given command, once all input files for a particular job are present.ARGS
can be used to specify options of the underlying cluster system, thereby using the job properties input, output, params, wildcards, log, threads and dependencies, e.g.:--drmaa ‘ -pe threaded {threads}’
. Note thatARGS
must be given in quotes and with a leading whitespace.--cluster-config
- A JSON or YAML file that defines the wildcards used incluster
for specific rules, instead of having them specified in the Snakefile. For example, for rulejob
you may define:{ ‘job’ : { ‘time’ : ‘24:00:00’ } }
to specify the time for rulejob
. You can specify more than one file. The configuration files are merged with later values overriding earlier ones.--dryrun
- Do not execute anything, and display what would be done. If you have a very large workflow, use--dryrun --quiet
to just print a summary of the DAG of jobs.
See the Snakemake documentation for a list of all options.
snakemake --configfile "myconfig.yml" --dryrun
Job counts:
count jobs
1 all
1 combined_counts
5 count
5 fastqc
1 multiqc
5 star_align
1 star_genome_index
5 trimmomatic
24
snakemake --configfile "myconfig.yml" --use-conda --cores 4
snakemake \
--configfile "myconfig.yml" \
--cluster-config "mycluster_config.yml" \
--cluster "sbatch --cpus-per-task={cluster.n} --mem={cluster.memory} --time={cluster.time}" \
--use-conda \
--cores 100
Running workflow on Princeton LSI cluster using DRMAA.
Note: When using DRMAA you may need to export the DRMAA_LIBRARY_PATH
.
On the LSI cluster this can be done by running module load slurm
.
module load slurm
snakemake \
--configfile "myconfig.yml" \
--cluster-config "cetus_cluster_config.yml" \
--drmaa " --cpus-per-task={cluster.n} --mem={cluster.memory} --qos={cluster.qos} --time={cluster.time}" \
--use-conda \
--cores 1000 \
--output-wait 60