/grape-nf

An automated RNA-seq pipeline using Nextflow

Primary LanguageShellGNU General Public License v3.0GPL-3.0

Grape

Build Status

Grape provides an extensive pipeline for RNA-Seq analyses. It allows the creation of an automated and integrated workflow to manage and analyse RNA-Seq data.

It uses Nextflow as the execution backend. Please check Nextflow documentation for more information.

Quickstart

Installing Nextflow

Nextflow is distributed as a self-contained executable package, this means that it does not require any special installation procedure.

It only needs two easy steps:

  • Download the executable package by copying and pasting the following command in your terminal window: wget -qO- get.nextflow.io | bash. It will create the nextflow main executable file in the current directory.

  • Optionally, move the nextflow file in a directory accessible by your $PATH variable (this is only required to avoid to remember and type the Nextflow full path each time you need to run it).

Tip
In the case you don’t have wget installed you can use the curl utility instead by entering the following command: curl -fsSL get.nextflow.io | bash
Note
The pipeline requires Nextflow version 0.23.1 or higher. Make sure you have the right version by running nextflow info. In case you have an older version you can upgrade as follows:
$ nextflow -self-update

Setting up the pipeline

Install the pipeline

Using Nextflow github sharing feature the pipeline can be easily installed:

$ nextflow pull guigolab/grape-nf
Checking guigolab/grape-nf ...
 downloaded from https://github.com/guigolab/grape-nf.git

Run a test

The pipeline has a quick test profile already configured.

Warning
Docker is required in order to run the test

First create a working directory:

$ mkdir test && cd test

Then run the test:

$ nextflow run grape-nf -profile test
N E X T F L O W  ~  version 0.17.3
Launching 'guigolab/grape-nf' - revision: a6147f7add [master]

G R A P E ~ RNA-seq Pipeline

General parameters
------------------
Index file                      : /Users/emilio/.nextflow/assets/guigolab/grape-nf/test-index.txt
Genome                          : /Users/emilio/.nextflow/assets/guigolab/grape-nf/data/genome.fa
Annotation                      : /Users/emilio/.nextflow/assets/guigolab/grape-nf/data/annotation.gtf
Pipeline steps                  : mapping bigwig contig quantification

Mapping parameters
------------------
Tool                            : GEM 1.7.1
Max mismatches                  : 4
Max multimaps                   : 10

Bigwig parameters
-----------------
Tool                            : RGCRG 0.1
References prefix               : chr

Quantification parameters
-------------------------
Tool                            : FLUX 1.6.1
Mode                            : Genome

Execution information
---------------------
Engine                          : local
Use Docker                      : true
Error strategy                  : ignore

Dataset information
-------------------
Number of sequenced samples     : 1
Number of sequencing runs       : 1
Merging                         : none

===============
Output files db -> /Users/emilio/workspace/grape-nf/pipeline.db
===============

[warm up] executor > local
[09/a54cf9] Submitted process > fastaIndex (genome-SAMtools-0.1.19)
[65/89ba24] Submitted process > index (genome-GEM-1.7.1)
[da/39803f] Submitted process > mapping (test1-GEM-1.7.1)
[81/bcd1cf] Submitted process > inferExp (test1-RSeQC-2.3.9)
[a9/686572] Submitted process > quantification (test1-FLUX-1.6.1)
[67/2783a7] Submitted process > contig (test1-RGCRG-0.1)
[87/c103c9] Submitted process > bigwig (test1-RGCRG-0.1)

-----------------------
Pipeline run completed.
-----------------------

Get the pipeline help

The usage message can be seen with the following command:

$ nextflow run grape-nf --help
N E X T F L O W  ~  version 0.17.3
Launching 'guigolab/grape-nf' - revision: a6147f7add [master]

G R A P E ~ RNA-seq Pipeline
----------------------------
Run the GRAPE RNA-seq pipeline on a set of data.

Usage:
    grape-pipeline.nf --index INDEX_FILE --genome GENOME_FILE --annotation ANNOTATION_FILE [OPTION]...

Options:
    --help                              Show this message and exit.
    --index INDEX_FILE                  Index file.
    --genome GENOME_FILE                Reference genome file(s).
    --annotation ANNOTAION_FILE         Reference gene annotation file(s).
    --steps STEP[,STEP]...              The steps to be executed within the pipeline run. Possible values: "mapping", "bigwig", "contig", "quantification". Default: all
    --max-mismatches THRESHOLD          Set maps with more than THRESHOLD error events to unmapped. Default "4".
    --max-multimaps THRESHOLD           Set multi-maps with more than THRESHOLD mappings to unmapped. Default "10".
    --bam-sort METHOD                   Specify the method used for sorting the genome BAM file.
    --add-xs                            Add the XS field required by Cufflinks/Stringtie to the genome BAM file.

SAM read group options:
    --rg-platform PLATFORM              Platform/technology used to produce the reads for the BAM @RG tag.
    --rg-library LIBRARY                Sequencing library name for the BAM @RG tag.
    --rg-center-name CENTER_NAME        Name of sequencing center that produced the reads for the BAM @RG tag.
    --rg-desc DESCRIPTION               Description for the BAM @RG tag.

Configuring the pipeline

Executors

Nextflow provides different Executors to run processes on the local machine, on a computational grid or the cloud without any change to the actual code. By default a local executor is used, but it can be changed by using Nextflow executors.

For example, to run the pipeline in a computational cluster using Sun Grid Engine you can set up a nextflow.config file in your current working directory with something like:

process {
    executor = 'sge'
    queue    = 'my-queue'
    penv     = 'smp'
}

Input data

The pipeline needs as an input a tab separated file containing containing information about the FASTQ files to be processed. The needed columns in order are:

1

 sample

the sample identifier, used to merge bam files in case multiple runs for the same sample are present

2

 id

the run identifier (e.g. test1)

3

 path

the path to the fastq file

4

 type

the type (e.g. fastq)

5

 view

an attribute that specifies the content of the file (e.g. FqRd for single-end data or FqRd1/FqRd2 for paired-end data)

Note
Fastq files from paired-end data will be grouped together by id.

Here is an example from the test run:

sample1  test1   data/test1_1.fastq.gz   fastq   FqRd1
sample1  test1   data/test1_2.fastq.gz   fastq   FqRd2

Sample and id can be the same in case you don’t have/know sample identifiers:

sample1  test1   data/test1_1.fastq.gz   fastq   FqRd1
sample1  test1   data/test1_2.fastq.gz   fastq   FqRd2

Software

The default Grape configuration uses Docker to provision the programs needed for the execution. Pre-built Grape containers are publicly available at the Grape page in Docker Hub.

Nextflow also supports Environment Modules. Creating a working configuration for Grape using modules is not yet straightforward. If you need to use modules please contact us directly.

Pipeline profiles

The Grape pipeline can be run using different configuration profiles. The profiles essentially allow the user to run the analyses using different tools and configurations. To specify a profile you can use the -profiles Nextflow option.

The following profiles are available at present:

profile description
 gemflux

uses GEMtools for mapping pipeline and Flux Capacitor for isoform expression quantification

 starrsem

uses STAR for mapping and bigwig and RSEM for isoform expression quantification

 starflux

uses STAR for mapping and Flux Capacitor for isoform expression quantification

The default profile uses STAR and RSEM and set the --bam-sort option to samtools.

Run the pipeline

To run the pipeline first create a working directory and move there:

$ mkdir grape-pipeline && cd grape-pipeline

Here is a simple example of how you can run the pipeline once you set up it properly:

nextflow -bg run grape-nf --index input-files.tsv --genome refs/hg38.AXYM.fa --annotation refs/gencode.v21.annotation.AXYM.gtf --rg-platform ILLUMINA --rg-center-name CRG -resume > pipeline.log

By default the pipeline execution will stop as far as one of the processes fails. To change this behaviour you can use the [errorStrategy directive](http://www.nextflow.io/docs/latest/process.html#errorstrategy) of Nextflow processes. You can also specify it on command line. For example, to ignore errors and keep processing you can use -process.errorStrategy=ignore.

It is also possible to run a subset of pipeline steps using the option --steps. For example, the following command will only run the mapping and quantification steps:

nextflow -bg run grape-nf --steps mapping,quantification --index input-files.tsv --genome refs/hg38.AXYM.fa --annotation refs/gencode.v21.annotation.AXYM.gtf --rg-platform ILLUMINA --rg-center-name CRG > pipeline.log

Pipeline results

The pipeline compiles a list of output files into the file pipeline.db inside the current working folder. This format of this file is similar to the input with a few more columns:

1

 sample

the sample identifier, used to merge bam files in case multiple runs for the same sample are present

2

 id

the run identifier (e.g. test1)

3

 path

the path to the fastq file

4

 type

the type (e.g. bam)

5

 view

an attribute that specifies the content of the file (e.g. GenomeAlignments)

6

 readType

the input data type (either Single-End or Paired-End)

7

 readStrand

the inferred exepriment strandednes if any (it can be NONE for unstranded data, SENSE or ANTISENSE for single-end data, MATE1_SENSE or MATE2_SENSE for paired-end data.)

Here is an example from the test run:

sample1   test1   /path/to/results/sample1.contigs.bed    bed      Contigs                Paired-End   MATE2_SENSE
sample1   test1   /path/to/results/sample1.isoforms.gtf   gtf      TranscriptAnnotation   Paired-End   MATE2_SENSE
sample1   test1   /path/to/results/sample1.plusRaw.bw     bigWig   PlusRawSignal          Paired-End   MATE2_SENSE
sample1   test1   /path/to/results/sample1.genes.gff      gtf      GeneAnnotation         Paired-End   MATE2_SENSE
sample1   test1   /path/to/results/test1_m4_n10.bam       bam      GenomeAlignments       Paired-End   MATE2_SENSE
sample1   test1   /path/to/results/sample1.minusRaw.bw    bigWig   MinusRawSignal         Paired-End   MATE2_SENSE

Software versions

The pipeline can be run without a container engine by installing the required software on the local system.

The required tool version for the standard profile that have been tested so far are the following: