Snakemake workflow: TM3'seq Pipeline

Workflow for the TM3'seq manuscript.

The workflow is written using Snakemake. Dependencies are installed using Bioconda.

Overview

This workflow was designed to streamline the analysis of TM3'seq data. However, it can be used to process FASTQ files derived form any other RNA-seq protocol once the samples have been demultiplexed. The details of demultiplexing may vary depending on the protocol used for library preparation.

Starting with FASTQ files, the workflow 1) trims raw reads, 2) aligns them, and 3) counts the number of reads mapping to each gene. The output is a gene counts file that can be imported in standard software for the analisys of RNA-seq data.

Inputs

FASTQ files for each sample
Reference sequence in FASTA format
Gene model in GTF format
Configuration file(s) in YAML format

Outputs

combined_gene_counts.tsv - Tab delimited file of gene counts, one row per gene and one column per sample
multiqc.html - Quality control summary showing number of reads, trimming, mapping, and counting statistics
logs\ - Directory of log files for each job, check here first if you run into errors
working\ - Directory containing intermediate files for each job (e.g. bam files and count files for each sample)

Workflow

FASTQ summary and QC metrics - Use FastQC to determine some basic QC metrics from the raw FASTQ files
Trim reads - Use Trimmomatic to trim off adapter and low quality sequence from the ends of reads
Align reads - Use STAR to aign reads to the genome, accounting for known splice junctions
Deduplicate (optional) - Remove duplicates using nudup which utilizes unique molecular identifiers (UMI)
Count - Use featureCounts (part of the Subread package) to quanyify the number of reads uniquely mapped to each gene
Summarize - Combine the count files and run MultiQC to generate a summary report

Install prerequsites

Install conda
- If you have Anaconda installed, you already have it.
- Otherwise, install the Miniconda package.

Enable the Bioconda channel

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Setup environment and run workflow

Clone workflow into working directory
```
git clone <repo> <dir>
cd <dir>
```
Input data

Place demultiplexed fastq.gz files in a data directory

Edit configuration files as needed

cp config.defaults.yml myconfig.yml
nano myconfig.yml

# Only if running on a cluster
cp cluster_config.yml mycluster_config.yml
nano mycluster_config.yml

Install dependencies into an isolated environment

conda env create -n <project> --file environment.yml

Activate the environment
```
source activate <project>
```

Execute the workflow

snakemake --configfile "myconfig.yml" --use-conda

Common options

--configfile "myconfig.yml" - Override defaults using the configuration found in myconfig.yml
--use-conda - Use conda to create an environment for each rule, installing and using the exact version of the software required (recommended)
--cores - Use at most N cores in parallel (default: 1). If N is omitted, the limit is set to the number of available cores.
--cluster - Execute snakemake rules with the given submit command, e.g. qsub. Snakemake compiles jobs into scripts that are submitted to the cluster with the given command, once all input files for a particular job are present. The submit command can be decorated to make it aware of certain job properties (input, output, params, wildcards, log, threads and dependencies (see the argument below)), e.g.: $ snakemake –cluster ‘qsub -pe threaded {threads}’.
--drmaa - Execute snakemake on a cluster accessed via DRMAA. Snakemake compiles jobs into scripts that are submitted to the cluster with the given command, once all input files for a particular job are present. ARGS can be used to specify options of the underlying cluster system, thereby using the job properties input, output, params, wildcards, log, threads and dependencies, e.g.: --drmaa ‘ -pe threaded {threads}’. Note that ARGS must be given in quotes and with a leading whitespace.
--cluster-config - A JSON or YAML file that defines the wildcards used in cluster for specific rules, instead of having them specified in the Snakefile. For example, for rule job you may define: { ‘job’ : { ‘time’ : ‘24:00:00’ } } to specify the time for rule job. You can specify more than one file. The configuration files are merged with later values overriding earlier ones.
--dryrun - Do not execute anything, and display what would be done. If you have a very large workflow, use --dryrun --quiet to just print a summary of the DAG of jobs.

See the Snakemake documentation for a list of all options.

Examples

Dry run of the workflow with five samples

snakemake --configfile "myconfig.yml" --dryrun

Job counts:
        count   jobs
        1       all
        1       combined_counts
        5       count
        5       fastqc
        1       multiqc
        5       star_align
        1       star_genome_index
        5       trimmomatic
        24

Running workflow on a single computer with 4 threads

snakemake --configfile "myconfig.yml" --use-conda --cores 4

Running workflow using SLURM

snakemake \
    --configfile "myconfig.yml" \
    --cluster-config "mycluster_config.yml" \
    --cluster "sbatch --cpus-per-task={cluster.n} --mem={cluster.memory} --time={cluster.time}" \
    --use-conda \
    --cores 100

Running workflow on Princeton LSI cluster using DRMAA.

Note: When using DRMAA you may need to export the DRMAA_LIBRARY_PATH. On the LSI cluster this can be done by running module load slurm.

module load slurm
snakemake \
    --configfile "myconfig.yml" \
    --cluster-config "cetus_cluster_config.yml" \
    --drmaa " --cpus-per-task={cluster.n} --mem={cluster.memory} --qos={cluster.qos} --time={cluster.time}" \
    --use-conda \
    --cores 1000 \
    --output-wait 60

lparsons/TM3Seq-Pipeline