- Description
- Installation and usage (local machine)
- Installation and usage (HPC cluster)
- Directed Acyclic Graph of jobs
- References ๐
- Citation
A Snakemake pipeline for the analysis of messenger RNA-seq data. It processes mRNA-seq fastq files and delivers both raw and normalised/scaled count tables. This pipeline also outputs a QC report per fastq file and a .bam
mapping file to use with a genome browser for instance.
This pipeline can process single or paired-end data and is mostly suited for Illumina sequencing data.
This pipeline analyses the raw RNA-seq data and produces two files containing the raw and normalized counts.
- The raw fastq files will be trimmed for adaptors and quality checked with
fastp
. - The genome sequence FASTA file will be used for the mapping step of the trimmed reads using
STAR
. - A GTF annotation file will be used to obtain the raw counts using
subread featureCounts
. - The raw counts will be scaled by a custom R function that implements the
DESeq2
median of ratios method to generate the scaled ("normalized") counts.
- RNA-seq fastq files as listed in the
config/samples.tsv
file. Specify a sample name (e.g. "Sample_A") in thesample
column and the paths to the forward read (fq1
) and to the reverse read (fq2
). If you have single-end reads, leave thefq2
column empty. - A genomic reference in FASTA format. For instance, a fasta file containing the 12 chromosomes of tomato (Solanum lycopersicum).
- A genome annotation file in the GTF format. You can convert a GFF annotation file format into GTF with the gffread program from Cufflinks:
gffread my.gff3 -T -o my.gtf
.โ ๏ธ for featureCounts to work, the feature in the GTF file should beexon
while the meta-feature has to betranscript_id
.
Below is an example of a GTF file format.
seqname | source | feature | start | end | score | strand | frame | attributes |
---|---|---|---|---|---|---|---|---|
SL4.0ch01 | maker_ITAG | CDS | 279 | 743 | . | + | 0 | transcript_id "Solyc01g004000.1.1"; gene_id "gene:Solyc01g004000.1"; gene_name "Solyc01g004000.1"; |
SL4.0ch01 | maker_ITAG | exon | 1173 | 1616 | . | + | . | transcript_id "Solyc01g004002.1.1"; gene_id "gene:Solyc01g004002.1"; gene_name "Solyc01g004002.1"; |
SL4.0ch01 | maker_ITAG | exon | 3793 | 3971 | . | + | . | transcript_id "Solyc01g004002.1.1"; gene_id "gene:Solyc01g004002.1"; gene_name "Solyc01g004002.1"; |
- A table of raw counts called
raw_counts.txt
: this table can be used to perform a differential gene expression analysis withDESeq2
. - A table of DESeq2-normalised counts called
scaled_counts.tsv
: this table can be used to perform an Exploratory Data Analysis with a PCA, heatmaps, sample clustering, etc. - fastp QC reports: one per fastq file.
- bam files: one per fastq file (or pair of fastq files).
- Some command of the Unix Shell to connect to a remote server where you will execute the pipeline. You can find a good tutorial from the Software Carpentry Foundation here and another one from Berlin Bioinformatics here.
- Some command of the Unix Shell to transfer datasets to and from a remote server (to transfer sequencing files and retrieve the results/). The Berlin Bioinformatics Unix begginer guide available here) should be sufficient for that (check the
wget
andscp
commands). - An understanding of the steps of a canonical RNA-Seq analysis (trimming, alignment, etc.). You can find some info here.
Snakefile
: a master file that contains the desired outputs and the rules to generate them from the input files.config/samples.tsv
: a file containing sample names and the paths to the forward and eventually reverse reads (if paired-end). This file has to be adapted to your sample names before running the pipeline.config/config.yaml
: the configuration files making the Snakefile adaptable to any input files, genome and parameter for the rules.config/refs/
: a folder containing- a genomic reference in fasta format. The
S_lycopersicum_chromosomes.4.00.chrom1.fa
is placed for testing purposes. - a GTF annotation file. The
ITAG4.0_gene_models.sub.gtf
for testing purposes.
- a genomic reference in fasta format. The
.fastq/
: a (hidden) folder containing subsetted paired-end fastq files used to test locally the pipeline. Generated using Seqtk:seqtk sample -s100 <inputfile> 250000 > <output file>
This folder should contain thefastq
of the paired-end RNA-seq data, you want to run.envs/
: a folder containing the environments needed for the pipeline:- The
environment.yaml
is used by the conda package manager to create a working environment (see below). - The
Dockerfile
is a Docker file used to build the docker image by refering to theenvironment.yaml
(see below).
- The
You will need a local copy of the GitHub snakemake_rnaseq
repository on your machine. You can either:
- use git in the shell:
git clone git@github.com:BleekerLab/snakemake_rnaseq.git
. - click on "Clone or download" and select
download
. - Then navigate inside the
snakemake_rnaseq
folder using Shell commands.
You'll need to change a few things to accomodate this pipeline to your needs. Make sure you have changed the parameters in the config/config.yaml
file that specifies where to find the sample data file, the genomic and transcriptomic reference fasta files to use and the parameters for certains rules etc.
This file is used so the Snakefile
does not need to be changed when locations or parameters need to be changed.
Using the conda package manager, you need to create an environment where core softwares such as Snakemake
will be installed.
- Install the Miniconda3 distribution (>= Python 3.7 version) for your OS (Windows, Linux or Mac OS X).
- Inside a Shell window (command line interface), create a virtual environment named
rnaseq
using theenvs/environment.yaml
file with the following command:conda env create --name rnaseq --file envs/environment.yaml
- Then, before you run the Snakemake pipeline, activate this virtual environment with
source activate rnaseq
.
While a conda
environment will in most cases work just fine, Docker is the recommended solution as it increases pipeline execution reproducibility.
๐ Option 2: using a Docker container
- Install Docker desktop for your operating system.
- Open a Shell window and type:
docker pull bleekerlab/snakemake_rnaseq:4.7.12
to retrieve a Docker image that includes the pipeline required softwares (Snakemake and conda and many others). - Run the pipeline on your system with:
docker run --rm -v $PWD:/home/snakemake/ bleekerlab/snakemake_rnaseq:4.7.12
and add any options for snakemake (-n
,--cores 10
) etc. The image was built using a Dockerfile based on the 4.7.12 Miniconda3 official Docker image.
- Install singularity
- Open a Shell window and type:
singularity run docker://bleekerlab/snakemake_rnaseq:4.7.12
to retrieve a Docker image that includes the pipeline required software (Snakemake and conda and many others). - Run the pipeline on your system with
singularity run snakemake_rnaseq_4.7.12.sif
and add any options for snakemake (-n
,--cores 10
) etc. The directory where the sif file is stored will automatically be mapped to/home/snakemake
. Results will be written to a folder named$PWD/results/
(you can changeresults
to something you like in theresult_dir
parameter of theconfig.yaml
).
- With conda: use the
snakemake -np
to perform a dry run that prints out the rules and commands. - With Docker: use the
docker run
With conda: snakemake --cores 10
You will need a local copy of the GitHub snakemake_rnaseq
repository on your machine. On a HPC system, you will have to clone it using the Shell command-line: git clone git@github.com:BleekerLab/snakemake_rnaseq.git
.
- click on "Clone or download" and select
download
. - Then navigate inside the
snakemake_rnaseq
folder using Shell commands.
See the detailed protocol here.
- Marc Galland, m.galland@uva.nl
- Tijs Bliek, m.bliek@uva.nl
- Frans van der Kloet f.m.vanderkloet@uva.nl
Johannes Kรถster; creator of Snakemake.
If you use this software, please use the following citation:
Bliek T., Chouaref J., van der Kloet F., Galland M. (2021). RNA-seq analysis pipeline (version 0.3.7). DOI: https://doi.org/https://doi.org/10.5281/zenodo.4707140