MpGAP pipeline

A generic multi-platform genome assembly pipeline

See the documentation »

Report Bug · Request Feature

About

MpGAP is an easy to use nextflow docker-based pipeline that adopts well known software for de novo genome assembly of Illumina, Pacbio and Oxford Nanopore sequencing data through illumina only, long reads only or hybrid modes. This pipeline wraps up the following software:

	Source
Assemblers	Canu, Flye, Raven, Shasta, wtdbg2, Haslr, Unicycler, Spades, Shovill
Polishers	Nanopolish, Medaka, gcpp, Pilon
Quality check	QUAST, MultiQC

Release notes

Are you curious about changes between releases? See the changelog.

I strongly, vividly, mightily recommend the usage of the latest versions hosted in master branch, which is nextflow's default.
- The latest will always have support, bug fixes and generally maitain the same processes (I mainly add things instead of removing) that also were in previous versions.
- But, if you really want to execute an earlier release, please see the instructions for that.
Versions below 2.0 are no longer supported.

Feedback

In the pipeline we always try to create a workflow and a execution dynamics that is the most generic possible and is suited for the most possible use cases.

Therefore, feedbacks are very well welcomed. If you believe that your use case is not encompassed in the pipeline, you have enhancement ideas or found a bug, please do not hesitate to open an issue to disscuss about it.

Requirements

This pipeline has only two dependencies: Docker and Nextflow.

Unix-like operating system (Linux, macOS, etc)
- Windows users maybe can execute it using the linux subsystem for windows as shown in:
Java 8 (or higher)
Nextflow (version 20.01 or higher)
Docker
- Image: fmalmeida/mpgap:v3.0

Installation

If you don't have it already install Docker in your computer.
- After installed, you need to download the required Docker images
```
docker pull fmalmeida/mpgap:v3.0
```
Install Nextflow (version 20.01 or higher):
```
curl -s https://get.nextflow.io | bash
```
Give it a try:
```
nextflow run fmalmeida/mpgap --help
```

🔥 Users can let the pipeline always updated with: nextflow pull fmalmeida/mpgap

Documentation

Explanation of hybrid strategies

Hybrid assemblies can be produced with two available strategies. Please read more about the strategies and how to set them up in the online documentation.

➡️ they are chosen with the parameter --hybrid_strategy.

Strategy 1

It uses the hybrid assembly modes from Unicycler, Haslr and/or SPAdes.

Strategy 2

It produces a long reads only assembly and polishes (correct errors) it with short reads using Pilon.

If polishing with Illumina paired end reads pilon will be executed with Unicycler-polish program, taking advantage of its ability to perform multiple rounds of polishing until changes are minimal.

Example:

# run the pipeline setting the desired hybrid strategy globally (for all samples)
nextflow run fmalmeida/mpgap \
  --output output \
  --threads 5 \
  --input "samplesheet.yml" \
  --hybrid_strategy "both"

🔥 This will perform, for all samples, both both strategy 1 and strategy 2 hybrid assemblies. Please read more about it in the manual reference page and samplesheet reference page.

Usage

For understading pipeline usage and configuration, users must read the complete online documentation »

Using the configuration file

All parameters showed above can be, and are advised to be, set through the configuration file. When a configuration file is used the pipeline is executed as nextflow run fmalmeida/mpgap -c ./configuration-file. Your configuration file is what will tell the pipeline which type of data you have, and which processes to execute. Therefore, it needs to be correctly configured.

To create a configuration file in your working directory:
```
nextflow run fmalmeida/mpgap --get_config
```

Interactive graphical configuration and execution

Via NF tower launchpad (good for cloud env execution)

Nextflow has an awesome feature called NF tower. It allows that users quickly customise and set-up the execution and configuration of cloud enviroments to execute any nextflow pipeline from nf-core, github (this one included), bitbucket, etc. By having a compliant JSON schema for pipeline configuration it means that the configuration of parameters in NF tower will be easier because the system will render an input form.

Checkout more about this feature at: https://seqera.io/blog/orgs-and-launchpad/

Via nf-core launch (good for local execution)

Users can trigger a graphical and interactive pipeline configuration and execution by using nf-core launch utility. nf-core launch will start an interactive form in your web browser or command line so you can configure the pipeline step by step and start the execution of the pipeline in the end.

# Install nf-core
pip install nf-core

# Launch the pipeline
nf-core launch fmalmeida/mpgap

It will result in the following:

Known issues

Whenever using unicycler with unpaired reads, an odd platform-specific SPAdes-related crash seems do randomly happen as it can be seen in the issue discussed at rrwick/Unicycler#188.

As a workaround, Ryan says to use the --no_correct parameter which solves the issue and does not have a negative impact on assembly quality.
Therefore, if you run into this error when using unpaired data you can activate this workaroud with:
- --unicycler_additional_parameters " --no_correct ".

Sometimes, shovill assembler can fail and cause the pipeline to fail due to problems in estimating the genome size. This, is actually super simple to solve! Instead of letting the shovill assembler estimate the genome size, you can pass the information to it and prevent its fail:
- --shovill_additional_parameters " --gsize 3m "

Citation

To cite this pipeline users can use our Zenodo tag or directly via the github url. Users are encouraged to cite the programs used in this pipeline whenever they are used.

Please, do not forget to cite the software that were used whenever you use its outputs. See the list of tools.

Mxrcon/MpGAP