CONNET: Accurate Genome Consensus in Assembling Nanopore Sequencing Data via Deep Learning

Introduction

Single-molecule sequencing technologies produce much longer reads compared to next-generation sequencing, greatly improving the contiguity of de novo assembly of genomes. However, the relatively high error rates in long reads make it challenging to obtain high-quality assemblies, and a computationally-intensive consensus step is needed to resolve the discrepancies in the reads. Efficient consensus tools have emerged in the recent past, based on partial-order alignment. In this study, we discovered that the spatial relationship of alignment pileup is crucial to high-quality consensus and developed a deep learning-based consensus tool, CONNET, which outperforms the fastest tools, based on partial-order alignment, in terms of both accuracy and speed. We tested CONNET using a 90x dataset of E. coli and a 37x human dataset. In addition to achieving high-quality consensus results, CONNET is capable of delivering phased diploid genome consensus. Diploid consensus on the above human assembly further reduced 12% of the consensus errors made in the haploid results.

Installation

# make sure the following tools are installed
samtools 
minimap2
parallel
python2

# make sure the following Python packages are installed
tensorflow == 1.13.1
keras == 2.2.4
numpy == 1.16.4

git clone https://github.com/HKU-BAL/CONNET.git
cd CONNET

python2 setup.py build_ext --inplace
# This will compile a `parse_pileup.so` in current folder.

export CONNET=$PWD/connet.py 
export CONNET_DIPLOID=$PWD/diploid.sh

Quick demo

Step 1. Install
Step 2. Obtain sample input

bash sample_data/download.sh

Step 3. Run

mkdir ecoli_demo
cd ecoli_demo
python2 $CONNET ../models/ecoli.model1 ../models/ecoli.model2 ../sample_data/ecoli_raw_reads.fq ../sample_data/ecoli_draft_assembly.fa

Step 4. Result is at 2.fa

By default, CONNET runs for 2 iterations

Result from iteration 1 is at 1.fa

Pretrained Models

Included at models/

Trained on E. coli: models/ecoli.*
Trained on H. sapiens chromosome 1: models/human.chr1.*

N.B. correction phase and recovery phase are trained separately, *.model1 is trained for correction phase, *.model2 is trained for recovery phase. They are not compatible and both are necessary.

General usage

Haploid Consensus

# haploid consensus
mkdir new_experiment
cd new_experiment
python2 $CONNET model1 model2 raw_reads.fa draft_assembly.fa

Diploid consensus

# make sure whatsapp, bgzip, tabix is installed
mkdir new_experiment
cd new_experiment
bash $CONNET_DIPLOID model1 model2 raw_reads.fa draft_assembly.fa

Notes

CONNET was benchmarked on a 24-core Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz workstation

For machines with limited processors, reduce T (number of thread) in connet.py.
For machines with limited memory, reduce PHASE1_BATCHSIZE, PHASE2_BATCHSIZE (in bp) in connet.py.

twang18/CONNET