ONTseq pipeline

This is a Snakemake workflow for processing Oxford Nanopore Technology (ONT) sequencing data (base-calling and de-multiplexing) followed by some basic quality control assessment.

While the pipeline should be universal for all types of ONT sequencing data, it has so far only been tested for data generated on the MinION flowcell FLO-MIN106 with the ligation sequencing kit SQK-LSK109. However, you can specify the correct profile for base-calling with guppy in the Snakemake file.

The workflow assumes that it is started and run on a Unix-based system (Linux, Mac OS X).

Prerequisites

Conda with Python3

In order to be able to run the workflow, a working installation of conda and Python3 is required. This is easiest way of obtaining it is to download the Miniconda3 Python3 installer from https://docs.conda.io/en/latest/miniconda.html for your operating system and to follow the instructions on the website.

Python packages

Next to conda and Python3, there are two more Python modules that have to be installed prior to usage: snakemake and pyfastx.

You can install these either using conda

conda install snakemake pyfastx

or via the Python module manager pip

pip install snakemake pyfastx

ONT guppy

Finally, the workflow uses the Oxford Nanopore Technology (ONT) software guppy, which is proprietary and only available via the ONT community website. Please download the binaries suitable for your operating system and unpack them.

Workflow configuration

This snakemake workflow uses a configuration file in JSON format that we will provide as input to snakemake using the --configfile parameter. A template with the required fields is available under ONTseq_QC-config.template.

The configuration file expects the following values:

projname: name of the project that will be used to create a sub-folder in temporary directory
datadir: directory that is used to search for Fast5 files generate by the ONT sequencer
tmpdir: directory in which the temporary output of the analysis will be stored
projdir: directory in which the final output of the analysis will be stored
expectedbarcodes: list of the barcodes, one per line, that were used during library preparation and are therefore expected to be observed, e.g. barcode01
basecalltype: base calling mode to be used in guppy: hac or fast
guppydir: directory to which the files downloaded from the ONT community website were installed to; should contain both the bin as well as the data sub-folders

Make a local copy of the template configuration file and fill in the information that fit your data.

Running the workflow

After setting up the required programs and generating the configuration file, we can start our workflow. For the execution on a local machine, we can simply run:

snakemake -s ONTseq_QC.Snakefile \
          --configfile <config file> \
          --use-conda \
          --cores <number of cores>

Please change the path of the config file to the config file that you just created and adjust the number of cores to the desired number available on your system.

The option --use-conda will force snakemake to create a conda environment for this pipeline with the name ONTseq_QC and install the additional software prerequisites on the fly. These additional tools are:

If you want to avoid having to create a new temporary conda environment over and over again, you can additional specify the path in which this conda environment is created using --conda-prefix.

In case a computing cluster with a scheduling software such as SLURM is available, we can enable the submission to the cluster by using:

snakemake -s ONTseq_QC.Snakefile \
          --configfile <config file> \
          --cluster-config ONTseq_QC-SLURM.json \
          --cluster "sbatch --mem {cluster.mem} -p {cluster.partition} -o {cluster.out} -e
          {cluster.err} -c {threads}" \
          --use-conda \
          --cores <number of cores>

The actual configuration of scheduler might likely differ due to your local set-up and can be adjusted by altering the JSON configuration file ONTseq_QC-SLURM.json.

Output

During its runtime, the workflow will write all temporary output to the temporary folder inside the sub-folder of the project name, both specified in the configuration file. The final output will be written to the project folder specified in the same file.

Here is the overview of the output that is generated in the project folder:

fastq: contains the base-called, de-multiplexed FastQ files, one file per expected barcode
logs/nreads_per_barcode.txt: a table with the number of demultiplexed reads for every barcode observed
logs/pycoqc: the HTML and JSON report generated by PycoQC across the whole sequencing run
logs/nanostat: the simple text summary of the sequencing results per barcode produced by NanoStat
logs/fastqc: the output of FastQC for each expected barcode
logs/multiqc_report.html: the summary HTML report of the FastQC results by MultiQC across all barcodes

The temporary output folder will not be deleted automatically in order to provide the possibility for inspecting the output of the individual programs. If this temporary output is not required any longer, it can be just simply deleted.

alexhbnr/ONTseq_QC

ONTseq pipeline

Prerequisites

Workflow configuration

Running the workflow

Output