/DeepSimulator

Fork of lykaust15's DeepSimulator that I tweaked a bit to work on my macOS machine.

Primary LanguageC

This is a fork of https://github.com/lykaust15/DeepSimulator/ with some quick and dirty hacks made to get it working on my MacBook Pro, without the need to install Anaconda.

Minimal usage instructions:

  • Download the CPU version of Guppy from https://community.nanoporetech.com/downloads and extract it to your home directory so that you have its executables in the ~/ont-guppy-cpu/bin/ directory.

  • Run the following installations with pip2:

    pip2 install tensorflow==1.2.1
    pip2 install tflearn==0.3.2
    pip2 install tqdm==4.19.4
    pip2 install scipy==0.18.1
    pip2 install h5py==2.7.1
    pip2 install numpy==1.13.1
    pip2 install scikit-learn==0.20.3
    
  • Run it. For example:

    ./deep_simulator.sh \
        -i ~/genomes/GCF_000001895.5_Rnor_6.0_genomic.fna \
        -n 10 \
        -o ~/simulated-reads
    

    will output 10 simulated reads from the GCF_000001895.5_Rnor_6.0 genome into the ~/simulated-reads folder.

The original README from the original repo is preserved below, but to use this fork, you should disregard it.


DeepSimulator

The first deep learning based Nanopore simulator which can simulate the process of Nanopore sequencing.

Paper: DeepSimulator: a deep simulator for Nanopore sequencing [PDF]

If you find this tool useful, please cite our work using the following reference:

@article{deepsimulator,
    author = {Li, Yu and Han, Renmin and Bi, Chongwei and Li, Mo and Wang, Sheng and Gao, Xin},
    title = {DeepSimulator: a deep simulator for Nanopore sequencing},
    journal = {Bioinformatics},
    volume = {34},
    number = {17},
    pages = {2899-2908},
    year = {2018},
    doi = {10.1093/bioinformatics/bty223},
    URL = {http://dx.doi.org/10.1093/bioinformatics/bty223},
    eprint = {/oup/backfile/content_public/journal/bioinformatics/34/17/10.1093_bioinformatics_bty223/2/bty223.pdf}
}

Overview

Here we propose a deep learning based simulator, DeepSimulator, to mimic the entire pipeline of Nanopore sequencing. Starting from a given reference genome or assembled contigs, we simulate the electrical current signals by a context-dependent deep learning model, followed by a base-calling procedure to yield simulated reads. This workflow mimics the sequencing procedure more naturally. The thorough experiments performed across four species show that the signals generated by our context-dependent model are more similar to the experimentally obtained signals than the ones generated by the official context-independent pore model. In terms of the simulated reads, we provide a parameter interface to users so that they can obtain the reads with different accuracies ranging from 83 to 97%. The reads generated by the default parameter have almost the same properties as the real data.

Install

Prerequisites

Anaconda2 (https://www.anaconda.com/distribution/) or Minoconda2 (https://conda.io/miniconda.html). For example, users may download and install the following Anaconda2 package:

wget https://repo.anaconda.com/archive/Anaconda2-2018.12-Linux-x86_64.sh
bash Anaconda2-2018.12-Linux-x86_64.sh

Download the DeepSimulator package

git clone https://github.com/lykaust15/DeepSimulator.git
cd ./DeepSimulator/

Install all required modules

./install.sh

Examples

Context-dependent pore model

./pore_model.sh example/001c577a-a502-43ef-926a-b883f94d157b.true_fasta 0

Context-independent kmer pore model (using official 6mer)

./pore_model.sh example/001c577a-a502-43ef-926a-b883f94d157b.true_fasta 1

Case study (case_study.sh file shows the three-step pipeline of our tool clearly)

./case_study.sh -f example/artificial_human_chr22.fasta

Simulate the signal and read for a given sequence

./deep_simulator.sh -i example/001c577a-a502-43ef-926a-b883f94d157b.true_fasta -n -1

Run a test to generate simulated signals and reads for a given genome

./deep_simulator.sh -i example/artificial_human_chr22.fasta

Explanation of the content in the output folder

Within the output folder, there are several folders and files. If you run

./deep_simulator.sh -i example/artificial_human_chr22.fasta

then, within the folder 'artificial_human_chr22_DeepSimu/', there are six files: 'processed_genome', 'sampled_read.fasta', 'pass.fastq', 'fail.fastq', 'mapping.paf', and 'accuracy'. There is one folder: 'fast5/'. Let us explain all of them in chronological order.

After receiving the original input genome file, we first perform some essential preprocessing, resulting in the file 'processed_genome'. After that, we run the first module, sampling reads from the processed genome, resulting in 'sampled_read.fasta'. Then, the 'sampled_read.fasta' will go through the pore model, resulting in 'fast5/' folder, where we store the simulated signals in FAST5 file. If option '-O 1' is specified, then we create the 'align/' folder to store the repeat times for each position in each read. If option '-G 1' is specified, then we create the 'signal/' folder to store the simulated signal in txt format for each read.

Afterward, the 'fast5/' folder can be the input of the base-caller (e.g., we use Guppy_GPU by default). We collect the results from the base-caller into the two file 'pass.fastq' and 'fail.fastq' to record the passed and failed reads. Finally, we check the accuracy using minimap2, whose output is 'mapping.paf'. File 'accuracy' stores the accuracy for later reference.

Simulated VS original signal

Simulated signal

Original signal

Control the behavior of DeepSimulator

One can control the behavior of DeepSimulator, including the length distribution of the reads or the accuracy, etc., by using different options in deep_simulator.sh. Detailed descriptions of the parameters in deep_simulator.sh file can be refered to Section S4 in Supplementary material of DeepSimulator

Train customized model

Simple example

Our simulator supports training a pore model using a customized dataset. An simple example, which only used the CPU resource, would be like this:

./train_pore_model.sh -i example/customerized_data/

Within the data folder, there are two kinds of data should be provided. The first kind of data is the sequence, and the second kind of data is the corresponding nanopore raw signal. Users can find an example of each file in the 'customized_data' folder. After training, an model (three files, named "model_customized.ckpt*") would be generated in the folder 'pore_model/model'. The user can rename the build-in model (named "model_reg_seqs_gn179.ckpt*"") to a backup name and the customized model as "model_reg_seqs_gn179.ckpt*" (all the three files need to be changed accordingly) so that the user do not have to change the code of simulator to use the customized model.

Notice: Generally, we do not recommend user to train a customized pore model because the data preparation and model training are quite time consuming and there might be some unexpected errors because of the update of Tensorflow and the dependencies, such as CUDA and cuDNN, which notoriously annoying. We would make the model updated to the Nanopore technology development.

Advanced

The above example only uses CPU, which would take years to train a model. To accelerate the training process and take advantage the computational power of GPU, users can consider using the GPU version of Tensorflow. User should make sure the following dependencies are installed correctly before running the training code on a workstation with GPU card.

  1. CUDA (http://docs.nvidia.com/cuda/cuda-installation-guide-linux/#axzz4VZnqTJ2A)
  2. cuDNN (https://developer.nvidia.com/cudnn)
  3. Tensorflow-gpu (https://www.tensorflow.org/install/install_linux)

Users can refer to the Tensorflow website (https://www.tensorflow.org/) for more detailed instruction of setting up the environment.

This tool is for academic purposes and research use only. Any commercial use is subject for authorization from King Abdullah University of Science and technology “KAUST”. Please contact us at ip@kaust.edu.sa.