/nanocall

An Oxford Nanopore Basecaller

Primary LanguageC++MIT LicenseMIT

Nanocall: An Oxford Nanopore Basecaller

http://travis-ci.org/mateidavid/nanocall.svg?branch=master http://img.shields.io/:license-mit-blue.svg

Introduction

Nanocall is an alternative, open source, MIT licensed, basecaller for Oxford Nanopore Technologies (ONT) sequencing data. Published in Bioinformatics, 2016.

For the official ONT basecaller, see Metrichor.

Usefulness of Nanocall on recent ONT sequencing data

To understand the usefulness of Nanocall compared to Metrichor, some background is in order.

Before summer 2016: Before the summer of 2016, and the release of the R9 sequencing pore:

  • Metrichor was the only available basecaller for ONT data.
  • Metrichor’s source code was closed.
  • Metrichor was only available as a cloud service.

This state of affairs prompted us to develop Nanocall as an open-source local basecaller alternative to Metrichor.

After summer 2016: The summer of 2016 has brought along several significant developments from ONT:

As a result, Nanocall’s usefulness is now limited to:

  • a platform for developing new basecalling ideas, and
  • situations where, for various reasons, you do not have access to the official ONT basecaller(s).

If you want to use Nanocall on R9 data, Nanocall does support it directly, but its accuracy is significantly lower than that of Metrichor (unlike the case of R7.3, where the two had similar accuracy). The reason for the discrepancy is that Metrichor on R9 uses a more elaborate RNN-based approach, compared to the simple HMM-based one in Nanocall.

Levels of ONT sequencing data

Most people are only used to dealing with DNA bases. However, to understand where Nanocall fits in, we observe that there are 3 levels of ONT sequencing data:

  • Raw samples. These are direct (picoamp) current measurements, taken at preset intervals as the DNA molecule is threaded through the pore. This data is passed through the USB cable from the MinION to the controlling laptop running MinKNOW. These are stored in fast5 files at paths such as /Raw/Reads/Read_29/Signal.
  • Events. Each event is an aggregation of multiple consecutive raw samples, (ideally) corresponding to a certain DNA context found in the pore. The process of computing events from raw samples is referred to as event detection. These are stored in fast5 files at paths such as /Analyses/EventDetection_000/Reads/Read_29/Events.
  • DNA bases. These are the usual, finished product. The process of computing DNA bases from events is referred to as basecalling. These are stored in fast5 files at paths such as /Analyses/Basecall_2D_000/BaseCalled_2D/Fastq.

On R7.3, event detection was performed locally by MinKNOW, and events were passed on to, and used by Metrichor. Since Nanocall was developed as an alternative local basecaller for R7.3 data, Nanocall is designed to work with events, not with raw samples.

On (at least some versions of) R9, Metrichor would entirely redo the event detection directly from raw samples, disregarding any event detection done locally by MinKNOW. As such, it is less uncommon with R9 (than with R7.3) to see fast5 files without events. Nanocall cannot be run directly on such files. To use Nanocall on R9 data, you must either configure MinKNOW to perform local event detection, or pass the files through Metrichor to use its event detection.

Installation

Nanocall can be built from source in a classical UNIX environment, or directly under Docker. The Docker build might run under Windows, though this is not tested.

Under a Classical UNIX Environment

Nanocall uses cmake for configuration and make for building. The prerequisites needed for building are zlib and hdf5. On UNIX systems, hdf5 can be optionally built as a submodule. Example build:

mkdir /some/source/dir && cd /some/source/dir
git clone --recursive https://github.com/mateidavid/nanocall.git
cd nanocall
mkdir build && cd build
cmake ../src [-DCMAKE_INSTALL_PREFIX=/some/install/dir] [-DBUILD_HDF5=1] [-DHDF5_ROOT=/path/to/hdf5]
make
make install
/some/install/dir/bin/nanocall --version

Notes:

  • The default install prefix is /usr/local.
  • Setting BUILD_HDF5 will cause hdf5 to be downloaded and built as a submodule.
  • Setting HDF5_ROOT is only necessary if a copy of hdf5 is installed in a non-standard location. This is not needed when BUILD_HDF5 is used.

Under Docker

To avoid dealing with prerequisites, Nanocall can be conveniently built under Docker. The installation and configuration of Docker itself is outside of the scope of this document.

Simple “fat” build

The simplest way to run Nanocall under Docker is:

docker build -t nanocall https://github.com/mateidavid/nanocall.git
docker run --rm nanocall --version
docker run --rm -u $(id -u):$(id -g) -v /path/to/data:/data nanocall -t 4 . >output.fa

Howver, there are several problems with this build:

  • The docker image is “fat”, in that it contains all the build time dependencies of Nanocall, which are not needed at run time.
  • Without using -u, the image will create files with a UID of 0 on the mounted volumes of the host. To remove them, you will have to use sudo rm or sudo chown.
  • The timezone inside the image might be different from the host. This might confuse programs which depend on comparing modification times, most notably make.
Alternate “slim” build

To alleviate the problems mentioned above, you can build a “slim” Docker image as follows:

git clone --recursive --depth 1 https://github.com/mateidavid/nanocall.git
nanocall/script/build-slim-docker-image
docker run --rm nanocall --version
docker run --rm -v /path/to/data:/data nanocall -t 4 . >output.fa

Usage Examples

# Check version
nanocall --version

# Check command line parameters
nanocall --help

# Run on single file, save output and log
nanocall /path/to/file.fast5 >output.fa 2>log

# Run on directory, using 24 threads, discard log
nanocall -t 24 /path/to/data >output.fa 2>/dev/null

# Run on file-of-file-names
nanocall /path/to/files.fofn >output.fa

# Run Docker build on directory, using 4 threads
# Note: -u is not needed with the "slim" build
docker run --rm -u $(id -u):$(id -g) -v /path/to/data:/data nanocall -t 4 . >output.fa

License

Released under the MIT license.