The MinION device by Oxford Nanopore Technologies (ONT) is the first portable USB sequencing device which promises play a unique part in the future of DNA sequencing.
Not only is it portable, the underlying technology is able to produce long reads (1Mb) as compared to the current status quo of short reads (100 ~ 300 bp).
However it suffers from a high sequencing error rate.
The objective of this project is to apply deep neural network models to improve upon the base calling procedure. Initial models were based on Hidden Markov Models (HMMs) however several deep neural network implementations have already been published; DeepNano (RNN) (Boža et al 2017), Chiron (CNN + RNN) (Teng et al 2017).
The problem of base calling in computational biology runs parallel to machine translation in natural language processing (NLP) as both fields attempt to translate one sequence to another sequence.
Hence we can try to use cross-pollinate methods from both sides and see the results from this experiment.
git clone --recursive https://github.com/etheleon/deepore.git
We are using nvidia's customised docker nvidia-docker
.
Which is based on the 8.0-cudnn6-runtime-ubuntu16.04
tag
tensorflow version 1.3.0 python=2.7
We modified the docker from https://github.com/anurag/fastai-course-1.git
To start the container:
nvidia-docker run -it \
--entrypoint /bin/zsh \
-v /data/nanopore/new/fast5Dir/:/data \
--name nanopore \
-w /home/docker \
-p 8889:8888 \
etheleon/chiron
To start a new shell with a existing container running
containername="awesome_benz"
nvidia-docker exec -it $containername /bin/zsh
To train (ecoli), the model
- run preprocessing first
- run
chiron_rcnn_train.py
but remember to check 2 things
- set the the raw file directory, containing the
.signal
and.label
files - the logs directory, by default this will be pointing to
/home/docker/out/logs
. Remember to backup the contents of this folder if you're running a new model, else the checkpoint data will saved over.
For the ecoli dataset, the raw files are in /home/docker/ecoli/data/ecoli_raw
➜ deepore git:(master) ✗ ls ~/ecoli/data/ecoli_raw | head
nanopore2_20160728_FNFAB24462_MN17024_sequencing_run_E_coli_K12_1D_R9_SpotOn_2_40525_ch100_read381_strand1.label
nanopore2_20160728_FNFAB24462_MN17024_sequencing_run_E_coli_K12_1D_R9_SpotOn_2_40525_ch100_read381_strand1.signal
nanopore2_20160728_FNFAB24462_MN17024_sequencing_run_E_coli_K12_1D_R9_SpotOn_2_40525_ch100_read423_strand.label
nanopore2_20160728_FNFAB24462_MN17024_sequencing_run_E_coli_K12_1D_R9_SpotOn_2_40525_ch100_read423_strand.signal
export CUDA_VISIBLE_DEVICES="1"
newChiron=</path/2/new/chiron>
python $newChiron/chiron/chiron_rcnn_train.py
To run original chiron the 8.0-cudnn5-runtime-ubuntu16.04
tag should be used since tensorflow 1.0.1 relies on cudnn5.
Reference sequence NC_000913
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.fna
- Ecoli reads in fast5 format from Nic Loman's lab link [need citation]
Reference sequence NC_012532.1
The reads are amplicons from 36 primers meant to cover the whole of the zika genome from Quick et. al 2017.
- A subset of 254 reads from human genome (chromosome 12 part 9, chiron used chromosome 23 part 3) from the nanopore WGS consortium [need citation]
Based on proprietary basecalled sequence, we align using reference sequence NC_000913 to correct for basecall errors.
bash ./preprocessing/resquiggle.sh
Rmbr to edit the variables in resquiggle.sh
Dataset | # reads | Failed Alignment | Reference sequence |
---|---|---|---|
Ecoli | 164472 | 171 | NC_000913 |
Zika | 9608 | NC_012532 |
bash ./preprocessing/runraw.sh
Boža, V, Brejová, B, Vinař, T (2017). DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads. PLoS ONE, 12, 6:e0178751.
Quick, J., Grubaugh, N. D., Pullan, S. T., Claro, I. M., Smith, A. D., Gangavarapu, K., … Nature, S. (2017). Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples, 12(6). http://doi.org/10.1038/nprot.2017.066
Teng, H, Hall, M B, Duarte, T, Cao, M D, Coin, L (2017). Chiron: Translating nanopore raw signal directly into nucleotide sequence using deep learning. bioRxiv,