RNA-seq analysis pipeline

Description
Installation and usage (local machine)
Installation and usage (HPC cluster)
- Installation
- Usage
Directed Acyclic Graph of jobs
References 📗
Citation

Description

A Snakemake pipeline for the analysis of messenger RNA-seq data. It processes mRNA-seq fastq files and delivers both raw and normalised/scaled count tables. This pipeline also outputs a QC report per fastq file and a .bam mapping file to use with a genome browser for instance.
This pipeline can process single or paired-end data and is mostly suited for Illumina sequencing data.

Description

This pipeline analyses the raw RNA-seq data and produces two files containing the raw and normalized counts.

The raw fastq files will be trimmed for adaptors and quality checked with fastp.
The genome sequence FASTA file will be used for the mapping step of the trimmed reads using STAR.
A GTF annotation file will be used to obtain the raw counts using subread featureCounts.
The raw counts will be scaled by a custom R function that implements the DESeq2 median of ratios method to generate the scaled ("normalized") counts.

Input files

RNA-seq fastq files as listed in the config/samples.tsv file. Specify a sample name (e.g. "Sample_A") in the sample column and the paths to the forward read (fq1) and to the reverse read (fq2). If you have single-end reads, leave the fq2 column empty.
A genomic reference in FASTA format. For instance, a fasta file containing the 12 chromosomes of tomato (Solanum lycopersicum).
A genome annotation file in the GTF format. You can convert a GFF annotation file format into GTF with the gffread program from Cufflinks: gffread my.gff3 -T -o my.gtf. ⚠️ for featureCounts to work, the feature in the GTF file should be exon while the meta-feature has to be transcript_id.

Below is an example of a GTF file format. ⚠️ a real GTF file does not have column names (seqname, source, etc.). Remove all non-data rows.

seqname	source	feature	start	end	score	strand	frame	attributes
SL4.0ch01	maker_ITAG	CDS	279	743	.	+	0	transcript_id "Solyc01g004000.1.1"; gene_id "gene:Solyc01g004000.1"; gene_name "Solyc01g004000.1";
SL4.0ch01	maker_ITAG	exon	1173	1616	.	+	.	transcript_id "Solyc01g004002.1.1"; gene_id "gene:Solyc01g004002.1"; gene_name "Solyc01g004002.1";
SL4.0ch01	maker_ITAG	exon	3793	3971	.	+	.	transcript_id "Solyc01g004002.1.1"; gene_id "gene:Solyc01g004002.1"; gene_name "Solyc01g004002.1";

Output files

A table of raw counts called raw_counts.txt: this table can be used to perform a differential gene expression analysis with DESeq2.
A table of DESeq2-normalised counts called scaled_counts.tsv: this table can be used to perform an Exploratory Data Analysis with a PCA, heatmaps, sample clustering, etc.
fastp QC reports: one per fastq file.
bam files: one per fastq file (or pair of fastq files).

Prerequisites: what you should know before using this pipeline

Some command of the Unix Shell to connect to a remote server where you will execute the pipeline. You can find a good tutorial from the Software Carpentry Foundation here and another one from Berlin Bioinformatics here.
Some command of the Unix Shell to transfer datasets to and from a remote server (to transfer sequencing files and retrieve the results/). The Berlin Bioinformatics Unix begginer guide available here) should be sufficient for that (check the wget and scp commands).
An understanding of the steps of a canonical RNA-Seq analysis (trimming, alignment, etc.). You can find some info here.

Content of this GitHub repository

Snakefile: a master file that contains the desired outputs and the rules to generate them from the input files.
config/samples.tsv: a file containing sample names and the paths to the forward and eventually reverse reads (if paired-end). This file has to be adapted to your sample names before running the pipeline.
config/config.yaml: the configuration files making the Snakefile adaptable to any input files, genome and parameter for the rules.
config/refs/: a folder containing
- a genomic reference in fasta format. The S_lycopersicum_chromosomes.4.00.chrom1.fa is placed for testing purposes.
- a GTF annotation file. The ITAG4.0_gene_models.sub.gtf for testing purposes.
.fastq/: a (hidden) folder containing subsetted paired-end fastq files used to test locally the pipeline. Generated using Seqtk: seqtk sample -s100 <inputfile> 250000 > <output file> This folder should contain the fastq of the paired-end RNA-seq data, you want to run.
envs/: a folder containing the environments needed for the pipeline:
- The environment.yaml is used by the conda package manager to create a working environment (see below).
- The Dockerfile is a Docker file used to build the docker image by refering to the environment.yaml (see below).

Installation and usage (local machine)

Installation

You will need a local copy of the GitHub snakemake_rnaseq repository on your machine. You can either:

use git in the shell: git clone git@github.com:BleekerLab/snakemake_rnaseq.git.
click on "Clone or download" and select download.
Then navigate inside the snakemake_rnaseq folder using Shell commands.

Usage

Configuration ✏️

You'll need to change a few things to accomodate this pipeline to your needs. Make sure you have changed the parameters in the config/config.yaml file that specifies where to find the sample data file, the genomic and transcriptomic reference fasta files to use and the parameters for certains rules etc.
This file is used so the Snakefile does not need to be changed when locations or parameters need to be changed.

📍 Option 1: conda (easiest)

Using the conda package manager, you need to create an environment where core softwares such as Snakemake will be installed.

Install the Miniconda3 distribution (>= Python 3.7 version) for your OS (Windows, Linux or Mac OS X).
Inside a Shell window (command line interface), create a virtual environment named rnaseq using the envs/environment.yaml file with the following command: conda env create --name rnaseq --file envs/environment.yaml
Then, before you run the Snakemake pipeline, activate this virtual environment with source activate rnaseq.

While a conda environment will in most cases work just fine, Docker is the recommended solution as it increases pipeline execution reproducibility.

🐳 Option 2: Docker (recommended)

📍 Option 2: using a Docker container

Install Docker desktop for your operating system.
Open a Shell window and type: docker pull bleekerlab/snakemake_rnaseq:4.7.12 to retrieve a Docker image that includes the pipeline required softwares (Snakemake and conda and many others).
Run the pipeline on your system with: docker run --rm -v $PWD:/home/snakemake/ bleekerlab/snakemake_rnaseq:4.7.12 and add any options for snakemake (-n, --cores 10) etc. The image was built using a Dockerfile based on the 4.7.12 Miniconda3 official Docker image.

🐳 Option 3: Singularity

Install singularity
Open a Shell window and type: singularity run docker://bleekerlab/snakemake_rnaseq:4.7.12 to retrieve a Docker image that includes the pipeline required software (Snakemake and conda and many others).
Run the pipeline on your system with singularity run snakemake_rnaseq_4.7.12.sif and add any options for snakemake (-n, --cores 10) etc. The directory where the sif file is stored will automatically be mapped to /home/snakemake. Results will be written to a folder named $PWD/results/ (you can change results to something you like in the result_dir parameter of the config.yaml).

Dry run

With conda: use the snakemake -np to perform a dry run that prints out the rules and commands.
With Docker: use the docker run

Real run

With conda: snakemake --cores 10

Installation and usage (HPC cluster)

Installation

You will need a local copy of the GitHub snakemake_rnaseq repository on your machine. On a HPC system, you will have to clone it using the Shell command-line: git clone git@github.com:BleekerLab/snakemake_rnaseq.git.

click on "Clone or download" and select download.
Then navigate inside the snakemake_rnaseq folder using Shell commands.

Usage

See the detailed protocol here.

Directed Acyclic Graph of jobs

References 📗

Authors

Marc Galland, m.galland@uva.nl
Tijs Bliek, m.bliek@uva.nl
Frans van der Kloet f.m.vanderkloet@uva.nl

Pipeline dependencies

Acknowledgments 👏

Johannes Köster; creator of Snakemake.

Citation

If you use this software, please use the following citation:

Bliek T., Chouaref J., van der Kloet F., Galland M. (2021). RNA-seq analysis pipeline (version 0.3.7). DOI: https://doi.org/https://doi.org/10.5281/zenodo.4707140