RNA-Seq based pipeline

This repository contains the backbone RNA-Seq pipeline following the Snakemake gold standard implementation. Currently, the following steps are implemented:

  • QC with fastqc, multiQC, and Picard tool
  • alignment against the genome with STAR
  • expression quantification using Htseq2 and RSEM at both gene and transcript levels
  • construction of splicing graphs per input sample using SplAdder
  • compute the hypoxia score using R package sspath
  • differential expression analysis using DeSeq2 and LIMMA-Voom
  • generate TCGA plots for HIF genes using R-scripts

Conda environment

The tools used in the pipeline are defined in a conda environment file in the envs directory. This environment can be built once before the pipeline is run on many samples and then re-used for any further run. However, if the definition of the environment (envs/wf_rna_seq.yml) changes, the environment will be recreated.

Preliminary steps

Conda

To generate the conda environment upfront, one can run the following on a server with internet access:

snakemake --cores 1 -p --use-conda --conda-prefix $CONDA_DIR --create-envs-only

Here, $CONDA_DIR should contain (or be replaced with) a parent directory for the respective conda environment to be created.

Samples

To specify any samples to be run on or specific parameters to be changed samples.tsv and config.yaml need to be adapted.

Executing the pipeline

To use the workflow-rna-seq-star pipeline on a single compute node, you will need to follow the command:

bash
snakemake --cores 10 -p --use-conda --conda-prefix $CONDA_DIR

Again, $CONDA_DIR contains the (previously used) path to the conda environments.

Examples of outputs from each of the rules

Successful execution of the snakemake workflow will generate multiple files in different formats for each input sample. In addition to that, each rule generates a set of output files in different directories. The following table briefly defines the outputs files from each step.

Rules Description Directory Tool
QC_FASTQC generates quality control reports of the raw fastq files and returns htmland zip files qc/fastqc fastqc =0.11.8
QC_PICARD runs Picard CollectRnaSeqMetrics on the BAm files qc/picard picard =2.18.4
QC_QORTS runs QoRTs and generates geneBodyCoverage plot and QoRTs multiplot for each sample qc/qorts qorts =1.3.0
QC_RSEQC_BAMSTAT runs RSeQC tool to generate BAMSTAT on BAMs qc/rseqc rseqc =3.0.0
QC_RSEQC_READGC runs RSeQC tool to generate GC content qc/rseqc rseqc =3.0.0
QC_MULTIQC aggregates results from tools like FASTQC, RSEM, Picard, HTSeq to a html file qc/multiqc multiqc =1.7
RSEQC_PLOT generates multisample RSeQC GC content plot qc/rseqc rseqc =3.0.0
QORTS_PLOT generates multisample QoRTs geneBodyCoverage plot qc/qorts qorts =1.3.0 & R QoRTs library
ALIGN_GENOME aligns the raw fastq files, returns aligned files in *.BAM format alignments star =2.7.0
ALIGN_TRANSCRIPT aligns the raw fastq files returns aligned files at transcript level in .bam format alignments_transcript star =2.7.0
ALIGN_RSEM EXPR_RSEM returns the aligned .bam file for each input genes, and quantification results of those gene on isoform and gene level as .genes.results and .isoforms.results files expression/rsem rsem =1.3.1
EXPR_HTSEQ returns gene-level quantified files in *.txt format expression/htseq htseq =0.9.1
EXPR_SIMPLE EXPR_SIMPLE_NA returns expression quantified files in tsv format for each input sample. NA stands for expression quantification for non-alternative, reutrns non_alt.tsv file, and a combined hdf5 file for all samples expression/simple_counting count_hdf2tsv
DE_GENES_DESEQ DE_GENES_LIMMA returns *.tsv files each from two tools with differential expression values for each genes with estimated log2 fold changes , p-values, and adjusted p-values (FDR) diffexp/deseq diffexp/limma_voom deseq2 =1.22.1 limma =3.28.2
HYPO_SCORE returns hypoxia score for selected HIF genes (n=30) as a tsv file hypoxia hypoxia
HYPO_PLOT_ALL HYPO_PLOT_SKIN generates hypoxia boxplots for a sample with all TCGA cancer types, GTEX samples and boxplot with TCGA Melanoma and GTEX skin hypoxia r-sspaths =0.1.1 & rscripts
TCGA_BOXPLOT generates boxplots for patient samples in the context of TCGA for the genes in this file. To plot more genes, add them to the above file in the format HUGO_symbol,Ensembl_ID tcga_boxplot/png_images r-scripts
SPLICE_EVENTS_S SPLICE_COUNTS_S SPLICE_COUNTS_M SPLICE_BURDEN returns graphs of BAM files in pickle format, quantified graphs in .count.hdf5 formats, merged graphs and splicing burden in same formats splicing/spladder spladder==2.2.0

Customization of the workflow for different projects

The current workflow can be composed and customized according to your requirements by modifying the configuration settings and the snakefile. For the sake of the documentation, let us take the example of the in-house Fly-Base (FB) project. The output files for FB project are hypoxia score, splicing burden and TCGA plots. To generate only those outputs you will need to modify and comment some of the global variables and rules within some files. By doing so, you will see the execution of other rules will be skipped, and it results in only the required outputs. The following are the files you will need to modify.

  • sample: Add the new TP samples IDs, lane information and fastq file names to this tsv file in the corresponding columns.

  • snakefile: Comment the following target rules, in the snakefile.

    • DE_GENES_DESEQ: compute differential expression analysis using deseq2
    • DE_GENES_LIMMA: compute differential expression analysis using LIMMA -voom
    • EXPR_HTSEQ : quantify the expression of genes using HTSeq2
    • include: "rules/diffexp.smk" : Rules for differential expression analysis using deseq2 and LIMMA-voom
  • config: Within the configuration file file please change the paths for the following entries

    • fastqdir: Replace with the absolute path where your fastq files are located.
    • outdir: Replace with the path where you want your results
    • logdir: Replace with the path where you want your log files
    • workdir: Replace with the path where you want your workdir
    • samples_list: The list of samples you plan to run

Leave all other entries in above-mentioned files as default. By this, you are now ready to execute the workflow for the TP project. To run the workflow in the cluster please follow the following instructions.

Executing the pipeline on the cluster

To run the pipeline using the LSF scheduler, the following command can be run:

snakemake --use-conda --conda-prefix $CONDA_DIR --configfile config.yaml --cluster 'bsub -M {params.res_mem} -W {params.res_time} -n {threads} -R "rusage[mem={params.res_mem}]" -o {params.lsf_log}' -j 20 -k -p --latency-wait 120

Specifically, the following items are:

  • $CONDA_DIR contains the path to the conda environments
  • config.yaml contains the specific run-configurations for the project
  • {params.*} and {threads} contain the tool-specific resources defined in the rules / config
  • -j 20 allows a max of 20 jobs to be run at the same time
  • -k will continue the workflow even if a single jkob crashed but independent jobs can continue
  • -latency-wait 120 increases the wait-time for result files to 2 minutes (useful in a distributed system)

Snakemake Reports

Snakemake reports report.html can be generated once the pipeline run is over.

The current report have the hypoxia plots, qorts geneBody Coverage plot, RSeQC GC content plot and MultiQC html file.

To generate the snakemake report

snakemake --report report.html

To manage the order of tools in MultiQC html file, please edit this file

ToDo

  • scripts/tcga_boxplot/*, move to central gromics repo
  • Make the rule to generate rsem_reference