/nextflow

A DSL for data-driven computational pipelines

Primary LanguageGroovyApache License 2.0Apache-2.0

Nextflow logo

"Dataflow variables are spectacularly expressive in concurrent programming"
Henri E. Bal , Jennifer G. Steiner , Andrew S. Tanenbaum

Nextflow CI Nextflow version Chat on Gitter Nextflow Twitter Nextflow Publication install with bioconda Nextflow license

Quick overview

Nextflow is a bioinformatics workflow manager that enables the development of portable and reproducible workflows. It supports deploying workflows on a variety of execution platforms including local, HPC schedulers, AWS Batch, Google Cloud Life Sciences, and Kubernetes. Additionally, it provides support for manage your workflow dependencies through built-in support for Conda, Docker, Singularity, and Modules.

Contents

Rationale

With the rise of big data, techniques to analyse and run experiments on large datasets are increasingly necessary.

Parallelization and distributed computing are the best ways to tackle this problem, but the tools commonly available to the bioinformatics community often lack good support for these techniques, or provide a model that fits badly with the specific requirements in the bioinformatics domain and, most of the time, require the knowledge of complex tools or low-level APIs.

Nextflow framework is based on the dataflow programming model, which greatly simplifies writing parallel and distributed pipelines without adding unnecessary complexity and letting you concentrate on the flow of data, i.e. the functional logic of the application/algorithm.

It doesn't aim to be another pipeline scripting language yet, but it is built around the idea that the Linux platform is the lingua franca of data science, since it provides many simple command line and scripting tools, which by themselves are powerful, but when chained together facilitate complex data manipulations.

In practice, this means that a Nextflow script is defined by composing many different processes. Each process can execute a given bioinformatics tool or scripting language, to which is added the ability to coordinate and synchronize the processes execution by simply specifying their inputs and outputs.

Quick start

Download the package

Nextflow does not require any installation procedure, just download the distribution package by copying and pasting this command in your terminal:

curl -fsSL https://get.nextflow.io | bash

It creates the nextflow executable file in the current directory. You may want to move it to a folder accessible from your $PATH.

Download from Conda

Nextflow can also be installed from Bioconda

conda install -c bioconda nextflow 

Documentation

Nextflow documentation is available at this link http://docs.nextflow.io

HPC Schedulers

Nextflow supports common HPC schedulers, abstracting the submission of jobs from the user.

Currently the following clusters are supported:

For example to submit the execution to a SGE cluster create a file named nextflow.config, in the directory where the pipeline is going to be launched, with the following content:

process {
  executor='sge'
  queue='<your execution queue>'
}

In doing that, processes will be executed by Nextflow as SGE jobs using the qsub command. Your pipeline will behave like any other SGE job script, with the benefit that Nextflow will automatically and transparently manage the processes synchronisation, file(s) staging/un-staging, etc.

Cloud support

Nextflow also supports running workflows across various clouds and cloud technologies. Nextflow can create AWS EC2 or Google GCE clusters and deploy your workflow. Managed solutions from both Amazon and Google are also supported through AWS Batch and Google Genomics Pipelines. Additionally, Nextflow can run workflows on either on-prem or managed cloud Kubernetes clusters.

Currently supported cloud platforms:

Tool management

Containers

Nextflow has first class support for containerization. It supports both Docker and Singularity container engines. Additionally, Nextflow can easily switch between container engines enabling workflow portability.

process samtools {
  container 'biocontainers/samtools:1.3.1'

  """
  samtools --version 
  """

}

Conda environments

Conda environments provide another option for managing software packages in your workflow.

Environment Modules

Environment modules commonly found in HPC environments can also be used to manage the tools used in a Nextflow workflow.

Community

You can post questions, or report problems by using the Nextflow discussion forum or the Nextflow channel on Gitter.

Nextflow also hosts a yearly workshop showcasing researcher's workflows and advancements in the langauge. Talks from the past workshops are available on the Nextflow YouTube Channel

The nf-core project is a community effort aggregating high quality Nextflow workflows which can be used by the community.

Build from source

Required dependencies

  • Compiler Java 8 or later
  • Runtime Java 8 or later

Build from source

Nextflow is written in Groovy (a scripting language for the JVM). A pre-compiled, ready-to-run, package is available at the Github releases page, thus it is not necessary to compile it in order to use it.

If you are interested in modifying the source code, or contributing to the project, it worth knowing that the build process is based on the Gradle build automation system.

You can compile Nextflow by typing the following command in the project home directory on your computer:

make compile

The very first time you run it, it will automatically download all the libraries required by the build process. It may take some minutes to complete.

When complete, execute the program by using the launch.sh script in the project directory.

The self-contained runnable Nextflow packages can be created by using the following command:

make pack

Once compiled use the script ./launch.sh as a replacement for the usual nextflow command.

The compiled packages can be locally installed using the following command:

make install

A self-contained distribution can be created with the command: make pack. To include support of GA4GH and its dependencies in the binary, use make packGA4GH instead.

IntelliJ IDEA

Nextflow development with IntelliJ IDEA requires the latest version of the IDE (2019.1.2 or later).

If you have it installed in your computer, follow the steps below in order to use it with Nextflow:

  1. Clone the Nextflow repository to a directory in your computer.
  2. Open IntelliJ IDEA and choose "Import project" in the "File" menu bar.
  3. Select the Nextflow project root directory in your computer and click "OK".
  4. Then, choose the "Gradle" item in the "external module" list and click on "Next" button.
  5. Confirm the default import options and click on "Finish" to finalize the project configuration.
  6. When the import process complete, select the "Project structure" command in the "File" menu bar.
  7. In the showed dialog click on the "Project" item in the list of the left, and make sure that the "Project SDK" choice on the right contains Java 8.
  8. Set the code formatting options with setting provided here.

Contributing

Project contribution are more than welcome. See the CONTRIBUTING file for details.

Build servers

License

The Nextflow framework is released under the Apache 2.0 license.

Citations

If you use Nextflow in your research, please cite:

P. Di Tommaso, et al. Nextflow enables reproducible computational workflows. Nature Biotechnology 35, 316–319 (2017) doi:10.1038/nbt.3820

Credits

Nextflow is built on two great pieces of open source software, namely Groovy and Gpars.

YourKit is kindly supporting this open source project with its full-featured Java Profiler. Read more http://www.yourkit.com