/Deepbinner

a signal-level demultiplexer for Oxford Nanopore reads

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Deepbinner

Deepbinner is a tool for demultiplexing barcoded Oxford Nanopore sequencing reads. It does this with a deep convolutional neural network classifier, using many of the architectural advances that have proven successful in image classification. Unlike other demultiplexers (e.g. Albacore and Porechop), Deepbinner identifies barcodes from the raw signal (a.k.a. squiggle) which gives it greater sensitivity and fewer unclassified reads.

  • Reasons to use Deepbinner:
    • To minimise the number of unclassified reads (use Deepbinner by itself).
    • To minimise the number of misclassified reads (use Deepbinner in conjunction with Albacore demultiplexing).
    • You plan on running signal-level downstream analyses, like Nanopolish. Deepbinner can demultiplex the fast5 files which makes this easier.
  • Reasons to not use Deepbinner:
    • You only have basecalled reads not the raw fast5 files (which Deepbinner requires).
    • You have a small/slow computer. Deepbinner is more computationally intensive than Porechop.
    • You used a sequencing/barcoding kit other than the ones Deepbinner was trained on.

You can read more about Deepbinner in this preprint:
Wick RR, Judd LM, Holt KE. Deepbinner: Demultiplexing barcoded Oxford Nanopore reads with deep convolutional neural networks. bioRxiv. 2018; doi:10.1101/366526.

Table of contents

Requirements

Deepbinner runs on MacOS and Linux and requires Python 3.5+.

TensorFlow logo

Its most complex requirement is TensorFlow, which powers the neural network. TensorFlow can run on CPUs (easy to install, supported on many machines) or on NVIDIA GPUs (better performance). If you're only going to use Deepbinner to classify reads, you may not need GPU-level performance (read more here). But if you want to train your own Deepbinner neural network, then using a GPU is a necessity.

The simplest way to install TensorFlow for your CPU is with pip3 install tensorflow. Building TensorFlow from source may give slighly better performance (because it will use all instructions sets supported by your CPU) but the installation is more complex. If you are using Ubuntu and have an NVIDIA GPU, check out these instructions for installing TensorFlow with GPU support.

Deepbinner uses some other Python packages (Keras, NumPy and h5py) but these should be taken care of by pip when installing Deepbinner. It also assumes that you have gzip available on your command line. If you are going to train your own Deepbinner network, then you'll need a few more Python packages as well (see the training instructions).

Installation

Install from source

You can install Deepbinner using pip, either from a local copy:

git clone https://github.com/rrwick/Deepbinner.git
pip3 install ./Deepbinner
deepbinner --help

Or directly from GitHub:

pip3 install git+https://github.com/rrwick/Deepbinner.git
deepbinner --help

Run without installation

Deepbinner can be run directly from its repository by using the deepbinner-runner.py script, no installation required:

git clone https://github.com/rrwick/Deepbinner.git
Deepbinner/deepbinner-runner.py -h

If you run Deepbinner this way, it's up to you to make sure that all necessary Python packages are installed.

Quick usage

Demultiplex native barcoding reads that are already basecalled:

deepbinner classify --native fast5_dir > classifications
deepbinner bin --classes classifications --reads basecalled_reads.fastq.gz --out_dir demultiplexed_reads

Demultiplex rapid barcoding reads that are already basecalled:

deepbinner classify --rapid fast5_dir > classifications
deepbinner bin --classes classifications --reads basecalled_reads.fastq.gz --out_dir demultiplexed_reads

Demultiplex native barcoding raw fast5 reads (potentially in real-time during a sequencing run):

deepbinner realtime --in_dir fast5_dir --out_dir demultiplexed_fast5s --native

Demultiplex rapid barcoding raw fast5 reads (potentially in real-time during a sequencing run):

deepbinner realtime --in_dir fast5_dir --out_dir demultiplexed_fast5s --rapid

The sample_reads.tar.gz file in this repository contains a small test set: six fast5 files and a FASTQ of their basecalled sequences. When classified with Deepbinner, you should get two reads each from barcodes 1, 2 and 3.

Available trained models

Deepbinner currently only provides pre-trained models for the EXP-NBD103 native barcoding expansion and the SQK-RBK004 rapid barcoding kit. See more details here.

If you have different data, then pre-trained models aren't available. If you have lots of existing data, you can train your own network. Alternatively, if you can share your data with me, I could train a model and make it available as part of Deepbinner. Let me know!

Using Deepbinner after basecalling

If your reads are already basecalled, then running Deepbinner is a two-step process:

  1. Classify reads using the fast5 files
  2. Organise the basecalled FASTQ reads into bins using the classifications

Step 1: classifying fast5 reads

This is accomplished using the deepbinner classify command, e.g.:

deepbinner classify --native fast5_dir > classifications

Since the native barcoding kit puts barcodes on both the start and end of reads, Deepbinner will look for both. Most reads should have a barcode at the start, but barcodes at the end are less common. If a read has conflicting barcodes at the start and end, it will be put in the unclassified bin. The --require_both option makes Deepbinner only bin reads with a matching start and end barcode, but this is very stringent and will result in far more unclassified reads. See more on the wiki: Combining start and end barcodes. None of this applies if you are using rapid barcoding reads (--rapid), as they only have a barcode at the start.

Here is the full usage for deepbinner classify.

Step 2: binning basecalled reads

This is accomplished using the deepbinner bin command, e.g.:

deepbinner bin --classes classifications --reads basecalled_reads.fastq.gz --out_dir 

This will leave your original basecalled reads in place, copying the sequences out to new files in your specified output directory. Both FASTA and FASTQ reads inputs are okay, gzipped or not. Deepbinner will gzip the binned reads at the end of the process.

Here is the full usage for deepbinner bin.

Using Deepbinner before basecalling

If you haven't yet basecalled your reads, you can use deepbinner realtime to bin the fast5 files, e.g.:

deepbinner realtime --in_dir fast5s --out_dir demultiplexed_fast5s --native

This command will move (not copy) fast5 files from the --in_dir directory to the --out_dir directory. As the command name suggests, this can be run in real-time – Deepbinner will watch the input directory and wait for new reads. Just set --in_dir to where MinKNOW deposits its reads. Or if you sequence on a laptop and copy the reads to a server, you can run Deepbinner on the server, watching the directory where the reads are deposited. Use Ctrl-C to stop it.

This command doesn't have to be run in real-time – it works just as well on a directory of fast5 files from a finished sequencing run.

Here is the full usage for deepbinner realtime (many of the same options as the classify command).

Using Deepbinner with Albacore demultiplexing

If you use both Deepbinner and Albacore to demultiplex reads, only keeping reads for which both tools agree on the barcode, you can achieve very low rates of misclassified reads (high precision, positive predictive value) but a larger proportion of reads will not be classified (put into the 'none' bin). This is what I usually do with my sequencing runs!

The easiest way to achieve this is to follow the Using Deepbinner before basecalling instructions above. Then run Albacore separately on each of Deepbinner's output directories, with its --barcoding option on. You should find that for each bin, Albacore puts most of the reads in the same bin (the reads we want to keep), some in the unclassified bin (slightly suspect reads, likely with lower quality basecalls) and a small number in a different bin (very suspect reads).

Here are some instructions and Bash code to carry this out automatically.

Performance

Deepbinner lives up to its name by using a deep neural network. It's therefore not particularly fast, but should be fast enough to keep up with a typical MinION run. If you want to squeeze out a bit more performance, try adjusting the 'Performance' options. Read more here for a detailed description of these options. In my tests, it can classify about 15 reads/sec using 12 threads (the default). Giving it more threads helps a little, but not much.

Building TensorFlow from source may give better performance (because it can then use all available instruction sets on your CPU). Running TensorFlow on a GPU will definitely give better Deepbinner performance: my tests on a Tesla K80 could classify over 100 reads/sec.

Training

You can train your own neural network with Deepbinner, but you'll need two things:

  • Lots of training data using the same barcoding and sequencing kits. More is better, so ideally from more than one sequencing run.
  • A fast computer to train on, ideally with TensorFlow running on a big GPU.

If you can meet those requirements, then read on in the Deepbinner training instructions!

Contributing

As always, the wider community is welcome to contribute to Deepbinner by submitting issues or pull requests.

I also have a particular need for one kind of contribution: training reads! The lab where I work has mainly used R9.4/R9.5 flowcells with the SQK-LSK108 kit. If you have other types of reads that you can share, I'd be interested (see here for more info).

Acknowledgments

I would like to thank James Ferguson from the Garvan Institute. We met at the Nanopore Day Melbourne event in February 2018 where I saw him present on raw signal detection of barcodes. It was then that the seeds of Deepbinner were sown!

I'm also in debt to Matthew Croxen for sharing his SQK-RBK004 rapid barcoding reads with me – they were used to build Deepbinner's pre-trained model for that kit.

License

GNU General Public License, version 3