"Dataflow variables are spectacularly expressive in concurrent programming"
Henri E. Bal , Jennifer G. Steiner , Andrew S. Tanenbaum

Quick overview

Nextflow is a bioinformatics workflow manager that enables the development of portable and reproducible workflows. It supports deploying workflows on a variety of execution platforms including local, HPC schedulers, AWS Batch, Google Genomics Pipelines, and Kubernetes. Additionally, it provides support for manage your workflow dependencies through built-in support for Conda, Docker, Singularity, and Modules.

Rationale
Quick start
Documentation
Tool Management
HPC Schedulers
Cloud Support
Community
Build from source
Contributing
License
Citations
Credits

Rationale

With the rise of big data, techniques to analyse and run experiments on large datasets are increasingly necessary.

Parallelization and distributed computing are the best ways to tackle this problem, but the tools commonly available to the bioinformatics community often lack good support for these techniques, or provide a model that fits badly with the specific requirements in the bioinformatics domain and, most of the time, require the knowledge of complex tools or low-level APIs.

Nextflow framework is based on the dataflow programming model, which greatly simplifies writing parallel and distributed pipelines without adding unnecessary complexity and letting you concentrate on the flow of data, i.e. the functional logic of the application/algorithm.

It doesn't aim to be another pipeline scripting language yet, but it is built around the idea that the Linux platform is the lingua franca of data science, since it provides many simple command line and scripting tools, which by themselves are powerful, but when chained together facilitate complex data manipulations.

In practice, this means that a Nextflow script is defined by composing many different processes. Each process can execute a given bioinformatics tool or scripting language, to which is added the ability to coordinate and synchronize the processes execution by simply specifying their inputs and outputs.

Quick start

Download the package

Nextflow does not require any installation procedure, just download the distribution package by copying and pasting this command in your terminal:

curl -fsSL get.nextflow.io | bash

It creates the nextflow executable file in the current directory. You may want to move it to a folder accessible from your $PATH.

Download from Conda

Nextflow can also be installed from Bioconda

conda install -c bioconda nextflow

Documentation

Nextflow documentation is available at this link http://docs.nextflow.io

HPC Schedulers

Nextflow supports common HPC schedulers, abstracting the submission of jobs from the user.

Currently the following clusters are supported:

For example to submit the execution to a SGE cluster create a file named nextflow.config, in the directory where the pipeline is going to be launched, with the following content:

process {
  executor='sge'
  queue='<your execution queue>'
}

In doing that, processes will be executed by Nextflow as SGE jobs using the qsub command. Your pipeline will behave like any other SGE job script, with the benefit that Nextflow will automatically and transparently manage the processes synchronisation, file(s) staging/un-staging, etc.

Cloud support

Nextflow also supports running workflows across various clouds and cloud technologies. Nextflow can create AWS EC2 or Google GCE clusters and deploy your workflow. Managed solutions from both Amazon and Google are also supported through AWS Batch and Google Genomics Pipelines. Additionally, Nextflow can run workflows on either on-prem or managed cloud Kubernetes clusters.

Currently supported cloud platforms:

Tool management

Containers

Nextflow has first class support for containerization. It supports both Docker and Singularity container engines. Additionally, Nextflow can easily switch between container engines enabling workflow portability.

process samtools {
  container 'biocontainers/samtools:1.3.1'

  """
  samtools --version 
  """

}

Conda environments

Conda environments provide another option for managing software packages in your workflow.

Environment Modules

Environment modules commonly found in HPC environments can also be used to manage the tools used in a Nextflow workflow.

Community

You can post questions, or report problems by using the Nextflow discussion forum or the Nextflow channel on Gitter.

Nextflow also hosts a yearly workshop showcasing researcher's workflows and advancements in the langauge. Talks from the past workshops are available on the Nextflow YouTube Channel

The nf-core project is a community effort aggregating high quality Nextflow workflows which can be used by the community.

Build from source

Required dependencies

Compiler Java 8
Runtime Java 8 or later

Build from source

Nextflow is written in Groovy (a scripting language for the JVM). A pre-compiled, ready-to-run, package is available at the Github releases page, thus it is not necessary to compile it in order to use it.

If you are interested in modifying the source code, or contributing to the project, it worth knowing that the build process is based on the Gradle build automation system.

You can compile Nextflow by typing the following command in the project home directory on your computer:

make compile

The very first time you run it, it will automatically download all the libraries required by the build process. It may take some minutes to complete.

When complete, execute the program by using the launch.sh script in the project directory.

The self-contained runnable Nextflow packages can be created by using the following command:

make pack

In order to install the compiled packages use the following command:

make install

Then you will be able to run nextflow using the nextflow launcher script in the project root folder.

Known compilation problems

Nextflow required JDK 8 to be compiled. The Java compiler used by the build process can be choose by setting the JAVA_HOME environment variable accordingly.

If the compilation stops reporting the error: java.lang.VerifyError: Bad <init> method call from inside of a branch, this is due to a bug affecting the following Java JDK:

1.8.0 update 11
1.8.0 update 20

Upgrade to a newer JDK to avoid to this issue. Alternatively a possible workaround is to define the following variable in your environment:

_JAVA_OPTIONS='-Xverify:none'

IntelliJ IDEA

Nextflow development with IntelliJ IDEA requires the latest version of the IDE (2018.3 or higher).

If you have it installed in your computer, follow the steps below in order to use it with Nextflow:

Clone the Nextflow repository to a directory in your computer.
Open IntelliJ IDEA and choose "Import project" in the "File" menu bar.
Select the Nextflow project root directory in your computer and click "OK".
Then, choose the "Gradle" item in the "external module" list and click on "Next" button.
Confirm the default import options and click on "Finish" to finalize the project configuration.
When the import process complete, select the "Project structure" command in the "File" menu bar.
In the showed dialog click on the "Project" item in the list of the left, and make sure that the "Project SDK" choice on the right contains Java 8.

Contributing

Project contribution are more than welcome. See the CONTRIBUTING file for details.

Information on setting up your development environment to work on Nextflow can be found here

Build servers

License

The Nextflow framework is released under the Apache 2.0 license.

Citations

If you use Nextflow in your research, please cite:

P. Di Tommaso, et al. Nextflow enables reproducible computational workflows. Nature Biotechnology 35, 316–319 (2017) doi:10.1038/nbt.3820

Credits

Nextflow is built on two great pieces of open source software, namely Groovy and Gpars.

YourKit is kindly supporting this open source project with its full-featured Java Profiler. Read more http://www.yourkit.com

paulu-aws/nextflow

Quick overview

Contents

Rationale

Quick start

Download the package

Download from Conda

Documentation

HPC Schedulers

Cloud support

Tool management

Containers

Conda environments

Environment Modules

Community

Build from source

Required dependencies

Build from source

Known compilation problems

IntelliJ IDEA

Contributing

Build servers

License

Citations

Credits