A basic pipeline for quantification of genomic features from short read data coming from ENCODE project implemented with Nextflow.
This example can also run locally, however instructions are also given to test it specifically on AWS Batch service
- Unix-like operating system (Linux, macOS, etc)
- Java 8
-
If you don't have it already install Docker in your computer. Read more here.
-
Install Nextflow (version 0.26.x or higher):
export NXF_VER=0.26.0-SNAPSHOT curl -s https://get.nextflow.io | bash
-
Launch the pipeline execution:
./nextflow run fstrozzi/rnaseq-encode-nf -with-docker
-
When the execution completes open in your browser the report generated at the following path:
results/multiqc_report.html
You can see an example report at the following link.
Note: the very first time you execute it, it will take a few minutes to download the pipeline from this GitHub repository and the the associated Docker images needed to execute the pipeline.
To run this pipeline on AWS using the Batch service, you need to:
-
Create an AWS Batch computing environment and queue, you can skip the job definition since Nextflow will do that for you. Follow the instructions on the AWS website.
-
Follow the indications on the Nextflow docs to prepare an AMI with enough disk space to run the workflow.
-
Install the AWS keys on the machine where you will execute Nextflow:
pip install awscli aws configure
-
Create a local nextflow.config file to specify the AWS Batch executor and parameters, plus the path of the AWS CLI on the AMI:
executor { name = 'awsbatch' awscli = '/scratch/miniconda/bin/aws' } process { queue = 'my-aws-batch-queue' }
-
Run the pipeline
./nextflow fstrozzi/rnaseq-encode-nf -w s3://bucket/prefix
RNASeq-NF execution relies on Nextflow framework which provides an abstraction between the pipeline functional logic and the underlying processing system.
This allows the execution of the pipeline in a single computer or in a HPC cluster without modifying it.
Currently the following resource manager platforms are supported:
- Univa Grid Engine (UGE)
- Platform LSF
- SLURM
- PBS/Torque
By default the pipeline is parallelized by spawning multiple threads in the machine where the script is launched.
To submit the execution to a UGE cluster create a file named nextflow.config
in the directory
where the pipeline is going to be executed with the following content:
process {
executor='uge'
queue='<queue name>'
}
To lean more about the avaible settings and the configuration file read the Nextflow documentation.
RNASeq-NF uses the following software components and tools: