/NanoSim-H

NanoSim-H: a simulator of Oxford Nanopore reads; a fork of NanoSim.

Primary LanguagePythonOtherNOASSERTION

NanoSim-H

https://travis-ci.org/karel-brinda/NanoSim-H.svg?branch=master https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square

About

NanoSim-H is a simulator of Oxford Nanopore reads that captures the technology-specific features of ONT data, and allows for adjustments upon improvement of Nanopore sequencing technology. NanoSim-H has been derived from NanoSim, a software package developed by Chen Yang at Canada's Michael Smith Genome Sciences Centre. The fork was created from version 1.0.1 and the versions of NanoSim-H and NanoSim are kept synchronized.

NanoSim-H is implemented using Python uses R for model fitting. In silico reads can be simulated from a given reference genome using nanosim-h. The NanoSim-H package is distributed with several precomputed error profiles, but additional profiles can be computed using the nanosim-h-train.

The main improvements compared to NanoSim are:

  • Support for Python 3
  • Support for RNF read names
  • Installation from PyPI
  • Error profiles distributed with the main package
  • Automatic testing using Travis
  • Reproducible simulations (setting a seed for PRG)
  • Improved interface with new parameters (e.g., for merging all contigs) and a progress bar
  • Various bugs fixed

Quick example

Simulation of 100 reads from an E.coli genome.

pip install --upgrade nanosim-h
curl "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=nuccore&dopt=fasta&val=545778205&sendto=on" | \
        nanosim-h -n 100 -

Installation

From BioConda (recommended):

conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda install -y nanosim-h

From PyPI :

pip install --upgrade nanosim-h

From Github:

git clone https://github.com/karel-brinda/nanosim-h
cd nanosim-h
pip install --upgrade .

or

git clone https://github.com/karel-brinda/nanosim-h
cd nanosim-h
python setup.py install

Dependencies:

For read simulation:

For computing new error profiles:

  • LAST (tested with version 847)
  • R

When installed using Bioconda, all NanoSim-H dependencies get installed automatically. When installed using PIP, all dependencies for read simulation are installed automatically.

Read simulation

Simulation stage takes a reference genome and possibly a read profile as input, and outputs simulated reads in FASTA format. At this point, NanoSim-H supports uncompressed files only (no gzip).

$ nanosim-h --help
usage: nanosim-h [-h] [-v] [-p str] [-o str] [-n int] [-u float] [-m float]
                 [-i float] [-d float] [-s int] [--circular] [--perfect]
                 [--merge-contigs] [--rnf] [--rnf-add-cigar] [--max-len int]
                 [--min-len int] [--kmer-bias int]
                 <reference.fa>

Program:  NanoSim-H - a simulator of Oxford Nanopore reads.
Version:  1.1.0.4
Authors:  Chen Yang <cheny@bcgsc.ca> - author of the original software package (NanoSim)
          Karel Brinda <kbrinda@hsph.harvard.edu> - author of the NanoSim-H fork

positional arguments:
  <reference.fa>        reference genome (- for standard input)

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -p str, --profile str
                        error profile - one of precomputed profiles
                        ('ecoli_R7.3', 'ecoli_R7', 'ecoli_R9_1D',
                        'ecoli_R9_2D', 'yeast', 'ecoli_UCSC1b') or own
                        directory with an error profile [ecoli_R9_2D]
  -o str, --out-pref str
                        prefix of output file [simulated]
  -n int, --number int  number of generated reads [10000]
  -u float, --unalign-rate float
                        rate of unaligned reads [detect from the error
                        profile]
  -m float, --mis-rate float
                        mismatch rate (weight tuning) [1.0]
  -i float, --ins-rate float
                        insertion rate (weight tuning) [1.0]
  -d float, --del-rate float
                        deletion rate (weight tuning) [1.0]
  -s int, --seed int    initial seed for the pseudorandom number generator (0
                        for random) [42]
  --circular            circular simulation (linear otherwise)
  --perfect             output perfect reads, no mutations
  --merge-contigs       merge contigs from the reference
  --rnf                 use RNF format for read names
  --rnf-add-cigar       add cigar to RNF names (not fully debugged, yet)
  --max-len int         maximum read length [inf]
  --min-len int         minimum read length [50]
  --kmer-bias int       prohibits homopolymers with length >= n bases in
                        output reads [6]

Examples: nanosim-h --circular ecoli_ref.fasta
          nanosim-h --circular --perfect ecoli_ref.fasta
          nanosim-h -p yeast --kmer-bias 0 yeast_ref.fasta

Notice: the use of `max-len` and `min-len` will affect the read length distributions. If
the range between `max-len` and `min-len` is too small, the program will run slowlier accordingly.

Examples:

  1. If you want to simulate reads from E. coli genome, then circular mode should be used because it is a circular genome.

    nanosim-h --circular Ecoli_ref.fasta

  2. If you want to simulate only perfect reads, i.e. no SNPs, or indels, just simulate the read length distribution.

    nanosimh-h --circular --perfect Ecoli_ref.fasta

  3. If you want to simulate reads from a S. cerevisiae genome with no k-mer bias, then linear mode should be chosen because it is a linear genome.

    nanosimh-h -p yeast --kmer-bias 0 yeast_ref.fasta

Output files:

  1. simulated.log – Log file for simulation process.

  2. simulated.fa – FASTA file of simulated reads. Reads can contain information about how they were created either in RNF, or in the original NanoSim naming convention.

    RNF naming convention

    See the associated RNF paper and RNF specification.

    NanoSim naming convention

    Each reads has "unaligned", "aligned", or "perfect" in the header determining their error rate. "unaligned" means that the reads have an error rate over 90% and cannot be aligned. "aligned" reads have the same error rate as training reads. "perfect" reads have no errors.

    To explain the information in the header, we have two examples:

    • >ref|NC-001137|-[chromosome=V]_468529_unaligned_0_F_0_3236_0
      All information before the first _ are chromosome information. 468529 is the start position and unaligned suggesting it should be unaligned to the reference. The first 0 is the sequence index. F represents a forward strand. 0_3236_0 means that sequence length extracted from the reference is 3236 bases.
    • >ref|NC-001143|-[chromosome=XI]_115406_aligned_16565_R_92_12710_2
      This is an aligned read coming from chromosome XI at position 115406. 16565 is the sequence index. R represents a reverse complement strand. 92_12710_2 means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.

    The information in the header can help users to locate the read easily.

  3. simulated.errors.txt – List of introduced errors.

    The output contains error type, position, original bases and current bases.

Error profiles

Characterization stage takes a reference and a training read set in FASTA format as input. User can also provide their own alignment file in MAF format.

Profiles distributed with NanoSim-H:

  • ecoli_R7
  • ecoli_R7.3
  • ecoli_R9_1D
  • ecoli_R9_2D (default error profile for read simulation)
  • ecoli_UCSC1b
  • yeast

New error profiles:

A new error profile can be obtained using the nanosim-h-train command.

$ nanosim-h-train --help
usage: nanosim-h-train [-h] [-v] [-i str] [-m str] [-b int] [--no-model-fit]
                       <reference.fa> <profile.dir>

Program:  NanoSim-H-Train - compute an error profile for NanoSim-H.
Version:  1.1.0.4
Authors:  Chen Yang <cheny@bcgsc.ca> - author of the original software package (NanoSim)
          Karel Brinda <kbrinda@hsph.harvard.edu> - author of the NanoSim-H fork

positional arguments:
  <reference.fa>        reference genome of the training reads
  <profile.dir>         error profile dir

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -i str, --infile str  training ONT real reads, must be fasta files
  -m str, --maf str     user can provide their own alignment file, with maf
                        extension
  -b int, --num-bins int
                        number of bins (for development) [20]
  --no-model-fit        no model fitting

Files associated with an error profile:

  1. aligned_length_ecdf – Length distribution of aligned regions on aligned reads.
  2. aligned_reads_ecdf – Length distribution of aligned reads.
  3. align_ratio – Empirical distribution of align ratio of each read.
  4. besthit.maf – The best alignment of each read based on length.
  5. match.hist, mis.hist, ins.hist, del.hist – Histograms of matches, mismatches, insertions, and deletions.
  6. first_match.hist – Histogram of the first match length of each alignment.
  7. error_markov_model – Markov model of error types.
  8. ht_ratio – Empirical distribution of the head region vs total unaligned region.
  9. training.maf – The output of LAST, alignment file in MAF format.
  10. match_markov_model – Markov model of the length of matches (stretches of correct base calls).
  11. model_profile – Fitted model for errors.
  12. processed.maf – A re-formatted MAF file for user-provided alignment file.
  13. unaligned_length_ecdf – Length distribution of unaligned reads

Cite

If you use NanoSim-H in your work, please cite both the original NanoSim paper and the NanoSim-H Zenodo lineage.

[1] Chen Yang, Justin Chu, René L Warren, Inanç Birol; NanoSim: nanopore sequence read simulator based on statistical characterization. Gigascience 2017 gix010. http://doi.org/10.1093/gigascience/gix010

[2] Karel Břinda, Chen Yang. NanoSim-H (Version 1.1.0.4). Zenodo. http://doi.org/10.5281/zenodo.1341249