This is a Snakemake workflow for processing Oxford Nanopore Technology (ONT) sequencing data (base-calling and de-multiplexing) followed by some basic quality control assessment.
While the pipeline should be universal for all types of ONT sequencing data, it has so far only been tested for data generated on the MinION flowcell FLO-MIN106 with the ligation sequencing kit SQK-LSK109. However, you can specify the correct profile for base-calling with guppy in the Snakemake file.
The workflow assumes that it is started and run on a Unix-based system (Linux, Mac OS X).
- Conda with Python3
In order to be able to run the workflow, a working installation of conda and Python3 is required. This is easiest way of obtaining it is to download the Miniconda3 Python3 installer from https://docs.conda.io/en/latest/miniconda.html for your operating system and to follow the instructions on the website.
- Python packages
Next to conda and Python3, there are two more Python modules that have to be installed prior to usage: snakemake and pyfastx.
You can install these either using conda
conda install snakemake pyfastx
or via the Python module manager pip
pip install snakemake pyfastx
- ONT guppy
Finally, the workflow uses the Oxford Nanopore Technology (ONT) software guppy, which is proprietary and only available via the ONT community website. Please download the binaries suitable for your operating system and unpack them.
This snakemake workflow uses a configuration file in JSON format that we will provide as input
to snakemake using the --configfile
parameter. A template with the required fields is
available under ONTseq_QC-config.template
.
The configuration file expects the following values:
projname
: name of the project that will be used to create a sub-folder in temporary directorydatadir
: directory that is used to search for Fast5 files generate by the ONT sequencertmpdir
: directory in which the temporary output of the analysis will be storedprojdir
: directory in which the final output of the analysis will be storedexpectedbarcodes
: list of the barcodes, one per line, that were used during library preparation and are therefore expected to be observed, e.g. barcode01basecalltype
: base calling mode to be used in guppy: hac or fastguppydir
: directory to which the files downloaded from the ONT community website were installed to; should contain both the bin as well as the data sub-folders
Make a local copy of the template configuration file and fill in the information that fit your data.
After setting up the required programs and generating the configuration file, we can start our workflow. For the execution on a local machine, we can simply run:
snakemake -s ONTseq_QC.Snakefile \
--configfile <config file> \
--use-conda \
--cores <number of cores>
Please change the path of the config file to the config file that you just created and adjust the number of cores to the desired number available on your system.
The option --use-conda
will force snakemake to create a conda environment for this pipeline
with the name ONTseq_QC
and install the additional software prerequisites on the fly. These
additional tools are:
If you want to avoid having to create a new temporary conda environment over and over again, you
can additional specify the path in which this conda environment is created using --conda-prefix
.
In case a computing cluster with a scheduling software such as SLURM is available, we can enable the submission to the cluster by using:
snakemake -s ONTseq_QC.Snakefile \
--configfile <config file> \
--cluster-config ONTseq_QC-SLURM.json \
--cluster "sbatch --mem {cluster.mem} -p {cluster.partition} -o {cluster.out} -e
{cluster.err} -c {threads}" \
--use-conda \
--cores <number of cores>
The actual configuration of scheduler might likely differ due to your local set-up and can be
adjusted by altering the JSON configuration file ONTseq_QC-SLURM.json
.
During its runtime, the workflow will write all temporary output to the temporary folder inside the sub-folder of the project name, both specified in the configuration file. The final output will be written to the project folder specified in the same file.
Here is the overview of the output that is generated in the project folder:
fastq
: contains the base-called, de-multiplexed FastQ files, one file per expected barcodelogs/nreads_per_barcode.txt
: a table with the number of demultiplexed reads for every barcode observedlogs/pycoqc
: the HTML and JSON report generated by PycoQC across the whole sequencing runlogs/nanostat
: the simple text summary of the sequencing results per barcode produced by NanoStatlogs/fastqc
: the output of FastQC for each expected barcodelogs/multiqc_report.html
: the summary HTML report of the FastQC results by MultiQC across all barcodes
The temporary output folder will not be deleted automatically in order to provide the possibility for inspecting the output of the individual programs. If this temporary output is not required any longer, it can be just simply deleted.