NanoSim: A Python repository from sailfish009

NanoSim is a fast and scalable read simulator that captures the technology-specific features of ONT data, and allows for adjustments upon improvement of nanopore sequencing technology.

The second version of NanoSim (v2.0.0) uses minimap2 as default aligner to align long genomic ONT reads to reference genome. It leads to much faster alignment step and reduces the overall runtime of NanoSim. We also utilize HTSeq, a python package, to read SAM alignment files efficiently.

NanoSim (v2.5) is able to simulate ONT transcriptome reads (cDNA / direct RNA) as well as genomic reads. It also models features of the library preparation protocols used, including intron retention (IR) events in cDNA and directRNA reads. Further, it has stand-alone modes which profiles transcript expression patterns and detects IR events in custom datasets. Additionally, we improved the homopolymer simulation option which simulates homopolymer expansion and contraction events with respect to chosen basecaller. Multiprocessing option allows for faster runtime for large library simulation.

NanoSim (v2.6) i able to simulate ONT reads in fastq format. The base quality information is simulated with truncated log-normal distributions, learnt separately from match bases, mismatch bases, insertion bases, deletion bases, and unaligned bases, each from different basecaller and read type.

We provide 6 pre-trained models in the latest release! Users can choose to download the whole package or only scripts without models to speed it up

If you use NanoSim to simulate genomic reads, please cite the following manuscript:

NanoSim
Chen Yang, Justin Chu, René L Warren, and Inanç Birol; NanoSim: nanopore sequence read simulator based on statistical characterization. GigaScience, Volume 6, Issue 4, April 2017, gix010, https://doi.org/10.1093/gigascience/gix010

If you use NanoSim to simulate transcriptomic reads, please cite the following manuscript:

Trans-NanoSim
Saber Hafezqorani, Chen Yang, Theodora Lo, Ka Ming Nip, René L. Warren, and Inanç Birol; Trans-NanoSim characterizes and simulates nanopore RNA-sequencing data. GigaScience, Volume 9, Issue 6, June 2020, giaa061, https://doi.org/10.1093/gigascience/giaa061

Dependencies

Python packages:

six
numpy (Tested with version 1.10.1 or above)
HTSeq (Tested with version 0.9.1)
Pysam (Tested with version 0.13)
scipy (Tested with verson 1.0.0)
scikit-learn (Tested with version 0.20.0)

minimap2 (Tested with version 2.10 and 2.17)
LAST (Tested with version 581 and 916)

Usage

NanoSim is implemented using Python for error model fitting, read length analysis, and simulation. The first step of NanoSim is read characterization, which provides a comprehensive alignment-based analysis, and generates a set of read profiles serving as the input to the next step, the simulation stage. The simulation tool uses the model built in the previous step to produce in silico reads for a given reference genome/transcriptome. It also outputs a list of introduced errors, consisting of the position on each read, error type and reference bases.

1. Characterization stage

Characterization stage runs in four mode: genome, transcriptome, quantify, and detect_ir. Below you may see the general usage of this code. We will explain each mode separately as well.

Characterization step general usage:

usage: read_analysis.py [-h] [-v]
                        {genome,transcriptome,quantify,detect_ir} ...

Read characterization step
-----------------------------------------------------------
Given raw ONT reads, reference genome and/or transcriptome,
learn read features and output error profiles

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

subcommands:
  
  There are four modes in read_analysis.
  For detailed usage of each mode:
      read_analysis.py mode -h
  -------------------------------------------------------

  {genome,transcriptome,quantify,detect_ir}
    genome              Run the simulator on genome mode
    transcriptome       Run the simulator on transcriptome mode
    quantify            Quantify expression profile of transcripts
    detect_ir           Detect Intron Retention events using the alignment
                        file

genome mode
If you are interested in simulating ONT genomic reads, you need to run the characterization stage in "genome" mode with following options. It takes a reference genome and a training read set in FASTA or FASTQ format as input and aligns these reads to the reference using minimap2 (default) or LAST aligner. User can also provide their own alignment file in SAM or MAF formats. If the SAM file is provided, make sure that is MD flag in the SAM file. The output of this is a bunch of profiles which you should use in simulation stage.

genome mode usage:

usage: read_analysis.py genome [-h] -i READ [-rg REF_G] [-a {minimap2,LAST}]
                               [-ga G_ALNM] [-o OUTPUT] [--no_model_fit]
                               [-t NUM_THREADS]

optional arguments:
  -h, --help            show this help message and exit
  -i READ, --read READ  Input read for training
  -rg REF_G, --ref_g REF_G
                        Reference genome, not required if genome alignment
                        file is provided
  -a {minimap2,LAST}, --aligner {minimap2,LAST}
                        The aligner to be used, minimap2 or LAST (Default =
                        minimap2)
  -ga G_ALNM, --g_alnm G_ALNM
                        Genome alignment file in sam or maf format (optional)
  -o OUTPUT, --output OUTPUT
                        The location and prefix of outputting profiles
                        (Default = training)
  --no_model_fit        Disable model fitting step
  -t NUM_THREADS, --num_threads NUM_THREADS
                        Number of threads for alignment and model fitting
                        (Default = 1)

transcriptome mode
If you are interested in simulating ONT transcriptome reads (cDNA / directRNA), you need to run the characterization stage in "transcriptome" mode with following options. It takes a reference transcriptome, a reference genome, and a training read set in FASTA or FASTQ format as input and aligns these reads to the reference using minimap2 (default) or LAST aligner. User can also provide their own alignment file in SAM or MAF formats. If the SAM file is provided, make sure that is MD flag in the SAM file. The output of this is a bunch of profiles which you should use in simulation stage.

transcriptome mode usage:

usage: read_analysis.py transcriptome [-h] -i READ [-rg REF_G] -rt REF_T
                                      [-annot ANNOTATION] [-a {minimap2,LAST}]
                                      [-ga G_ALNM] [-ta T_ALNM] [-o OUTPUT]
                                      [--no_model_fit] [--no_intron_retention]
                                      [-t NUM_THREADS]

optional arguments:
  -h, --help            show this help message and exit
  -i READ, --read READ  Input read for training
  -rg REF_G, --ref_g REF_G
                        Reference genome
  -rt REF_T, --ref_t REF_T
                        Reference Transcriptome
  -annot ANNOTATION, --annotation ANNOTATION
                        Annotation file in ensemble GTF/GFF formats, required
                        for intron retention detection
  -a {minimap2,LAST}, --aligner {minimap2,LAST}
                        The aligner to be used: minimap2 or LAST (Default =
                        minimap2)
  -ga G_ALNM, --g_alnm G_ALNM
                        Genome alignment file in sam or maf format (optional)
  -ta T_ALNM, --t_alnm T_ALNM
                        Transcriptome alignment file in sam or maf format
                        (optional)
  -o OUTPUT, --output OUTPUT
                        The location and prefix of outputting profiles
                        (Default = training)
  --no_model_fit        Disable model fitting step
  --no_intron_retention
                        Disable Intron Retention analysis
  -t NUM_THREADS, --num_threads NUM_THREADS
                        Number of threads for alignment and model fitting
                        (Default = 1)

quantify mode and detect_ir mode
The "transcriptome" mode of the NanoSim is able to model features of the library preparation protocols used, including intron retention (IR) events in cDNA and directRNA reads. Further, it optionally profiles transcript expression patterns. However, if you are interested in only detecting Intron Retention events or quantifying expression patterns of transcripts without running other analysis in the characterization stage, you may use two modes we introduced for this purpose: "quantify" and "detect_ir". Details are as follows:

quantifty mode usage:

usage: read_analysis.py quantify [-h] -i READ -rt REF_T [-o OUTPUT]
                                 [-t NUM_THREADS]

optional arguments:
  -h, --help            show this help message and exit
  -i READ, --read READ  Input reads for quantification
  -rt REF_T, --ref_t REF_T
                        Reference Transcriptome
  -o OUTPUT, --output OUTPUT
                        The location and prefix of outputting profile (Default
                        = expression)
  -t NUM_THREADS, --num_threads NUM_THREADS
                        Number of threads for alignment (Default = 1)

detect_ir mode usage:

usage: read_analysis.py detect_ir [-h] -annot ANNOTATION [-i READ] [-rg REF_G]
                                  [-rt REF_T] [-a {minimap2,LAST}] [-o OUTPUT]
                                  [-ga G_ALNM] [-ta T_ALNM] [-t NUM_THREADS]

optional arguments:
  -h, --help            show this help message and exit
  -annot ANNOTATION, --annotation ANNOTATION
                        Annotation file in ensemble GTF/GFF formats
  -i READ, --read READ  Input read for training, not required if alignment
                        files are provided
  -rg REF_G, --ref_g REF_G
                        Reference genome, not required if genome alignment
                        file is provided
                        
  -rt REF_T, --ref_t REF_T
                        Reference Transcriptome, not required if transcriptome
                        alignment file is provided
  -a {minimap2,LAST}, --aligner {minimap2,LAST}
                        The aligner to be used: minimap2 or LAST (Default =
                        minimap2)
  -o OUTPUT, --output OUTPUT
                        The output name and location
  -ga G_ALNM, --g_alnm G_ALNM
                        Genome alignment file in sam or maf format (optional)
  -ta T_ALNM, --t_alnm T_ALNM
                        Transcriptome alignment file in sam or maf format
                        (optional)
  -t NUM_THREADS, --num_threads NUM_THREADS
                        Number of threads for alignment (Default = 1)

* NOTICE: -ga/-ta option allows users to provide their own alignment file. Make sure that the name of query sequences are the same as appears in the FASTA files. For FASTA files, some headers have spaces in them and most aligners only take part of the header (before the first white space/tab) as the query name. However, the truncated headers may not be unique if using the output of poretools. We suggest users to pre-process the fasta files by concatenating all elements in the header via '_' before alignment and feed the processed FASTA file as input of NanoSim.

Downloads

Some ONT read profiles are ready to use for users. With the profiles, users can run simulation tool directly.

For releases before v2.2.0, we provide profiles trained for E. coli or S. cerevisiae datasets. Flowcell chemistry is R7.3 and R9, and they were basecalled by Metrichor. They can be downloaded from our ftp site

For release v2.5.0 and onwards, we provide profiles trained for H. sapiens NA12878 gDNA, cDNA 1D2, and directRNA datasets, and Mus. musculus cDNA dataset. Flowcell chemistry is R9.4 for all datasets. NA12878 gDNA and directRNA was basecalled by Guppy 3.1.5; NA12878 cDNA 1D2 was basecalled by Albacore 2.3.1; mouse cDNA was basecalled by Metrichor. These models are available within pre-trained_models folder.

2. Simulation stage

Simulation stage takes reference genome/transcriptome and read profiles as input and outputs simulated reads in FASTA format. Simulation stage runs in two modes: "genome" and "transcriptome" and you may use either of them based on your needs.

Simulation stage general usage:

usage: simulator.py [-h] [-v] {genome,transcriptome} ...

Simulation step
-----------------------------------------------------------
Given error profiles, reference genome and/or transcriptome,
simulate ONT DNA or RNA reads

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

subcommands:
  
  There are two modes in read_analysis.
  For detailed usage of each mode:
      simulator.py mode -h
  -------------------------------------------------------

  {genome,transcriptome}
                        You may run the simulator on transcriptome or genome
                        mode.
    genome              Run the simulator on genome mode
    transcriptome       Run the simulator on transcriptome mode

genome mode
If you are interested in simulating ONT genomic reads, you need to run the simulation stage in "genome" mode with following options.

genome mode usage:

usage: simulator.py genome [-h] -rg REF_G [-c MODEL_PREFIX] [-o OUTPUT]
                           [-n NUMBER] [-max MAX_LEN] [-min MIN_LEN]
                           [-med MEDIAN_LEN] [-sd SD_LEN] [--seed SEED]
                           [-k KMERBIAS] [-b {albacore,guppy,guppy-flipflop}]
                           [-s STRANDNESS] [-dna_type {linear,circular}]
                           [--perfect] [--fastq] [-t NUM_THREADS]

optional arguments:
  -h, --help            show this help message and exit
  -rg REF_G, --ref_g REF_G
                        Input reference genome
  -c MODEL_PREFIX, --model_prefix MODEL_PREFIX
                        Location and prefix of error profiles generated from
                        characterization step (Default = training)
  -o OUTPUT, --output OUTPUT
                        Output location and prefix for simulated reads
                        (Default = simulated)
  -n NUMBER, --number NUMBER
                        Number of reads to be simulated (Default = 20000)
  -max MAX_LEN, --max_len MAX_LEN
                        The maximum length for simulated reads (Default =
                        Infinity)
  -min MIN_LEN, --min_len MIN_LEN
                        The minimum length for simulated reads (Default = 50)
  -med MEDIAN_LEN, --median_len MEDIAN_LEN
                        The median read length (Default = None)
  -sd SD_LEN, --sd_len SD_LEN
                        The standard deviation of read length in log scale
                        (Default = None)
  --seed SEED           Manually seeds the pseudo-random number generator
  -k KMERBIAS, --KmerBias KMERBIAS
                        Minimum homopolymer length to simulate homopolymer
                        contraction andexpansion events in
  -b {albacore,guppy,guppy-flipflop}, --basecaller {albacore,guppy,guppy-flipflop}
                        Simulate homopolymers with respect to chosen
                        basecaller: albacore, guppy, or guppy-flipflop
  -s STRANDNESS, --strandness STRANDNESS
                        Proportion of sense sequences. Overrides the value
                        profiled in characterization stage. Should be between
                        0 and 1
  -dna_type {linear,circular}
                        Specify the dna type: circular OR linear (Default =
                        linear)
  --perfect             Ignore error profiles and simulate perfect reads
  --fastq               Output fastq files instead of fasta files
  -t NUM_THREADS, --num_threads NUM_THREADS
                        Number of threads for simulation (Default = 1)

transcriptome mode
If you are interested in simulating ONT transcriptome reads, you need to run the simulation stage in "transcriptome" mode with following options.

transcriptome mode usage:

usage: simulator.py transcriptome [-h] -rt REF_T [-rg REF_G] -e EXP
                                  [-c MODEL_PREFIX] [-o OUTPUT] [-n NUMBER]
                                  [-max MAX_LEN] [-min MIN_LEN] [--seed SEED]
                                  [-k KMERBIAS] [-b {albacore,guppy}]
                                  [-r {dRNA,cDNA_1D,cDNA_1D2}] [-s STRANDNESS]
                                  [--no_model_ir] [--perfect] [--polya POLYA]
                                  [--fastq] [-t NUM_THREADS] [--uracil]

optional arguments:
  -h, --help            show this help message and exit
  -rt REF_T, --ref_t REF_T
                        Input reference transcriptome
  -rg REF_G, --ref_g REF_G
                        Input reference genome, required if intron retention
                        simulatin is on
  -e EXP, --exp EXP     Expression profile in the specified format as
                        described in README
  -c MODEL_PREFIX, --model_prefix MODEL_PREFIX
                        Location and prefix of error profiles generated from
                        characterization step (Default = training)
  -o OUTPUT, --output OUTPUT
                        Output location and prefix for simulated reads
                        (Default = simulated)
  -n NUMBER, --number NUMBER
                        Number of reads to be simulated (Default = 20000)
  -max MAX_LEN, --max_len MAX_LEN
                        The maximum length for simulated reads (Default =
                        Infinity)
  -min MIN_LEN, --min_len MIN_LEN
                        The minimum length for simulated reads (Default = 50)
  --seed SEED           Manually seeds the pseudo-random number generator
  -k KMERBIAS, --KmerBias KMERBIAS
                        Enable k-mer bias simulation
  -b {albacore,guppy}, --basecaller {albacore,guppy}
                        Simulate homopolymers with respect to chosen  
                        basecaller: albacore or guppy
  -r {dRNA,cDNA_1D,cDNA_1D2}, --read_type {dRNA,cDNA_1D,cDNA_1D2}
                        Simulate homopolymers with respect to chosen read
                        type: dRNA, cDNA_1D or cDNA_1D2
  -s STRANDNESS, --strandness STRANDNESS
                        Proportion of sense sequences. Overrides the value
                        profiled in characterization stage. Should be between
                        0 and 1
  --no_model_ir         Simulate intron retention events
  --perfect             Ignore profiles and simulate perfect reads
  --polya POLYA         Simulate polyA tails for given list of transcripts
  --fastq               Output fastq files instead of fasta files
  -t NUM_THREADS, --num_threads NUM_THREADS
                        Number of threads for simulation (Default = 1)
  --uracil              Converts the thymine (T) bases to uracil (U) in the
                        output fasta format

* Notice: the use of max_len and min_len in genome mode will affect the read length distributions. If the range between max_len and min_len is too small, the program will run slowlier accordingly.

* Notice: the transcript name in the expression tsv file and the ones in th polyadenylated transcript list has to be consistent with the ones in the reference transcripts, otherwise the tool won't recognize them and don't know where to find them to extract reads for simulation.

Example runs:
1 If you want to simulate E. coli genome, then circular command must be chosen because it's a circular genome
./simulator.py genome -dna_type circular -rg Ecoli_ref.fasta -c ecoli

2 If you want to simulate only perfect reads, i.e. no snps, or indels, just simulate the read length distribution
./simulator.py genome -dna_type circular -rg Ecoli_ref.fasta -c ecoli --perfect

3 If you want to simulate S. cerevisiae genome with kmer bias, then linear command must be chosen because it's a linear genome
./simulator.py genome -dna_type linear -rg yeast_ref.fasta -c yeast --KmerBias

4 If you want to simulate human genome with length limits between 1000nt to 10000nt
./simulator.py genome -dna_type linear -rg human_ref.fasta -c human -max 10000 -min 1000

5 If you want to simulate human genome with controlled median read length and standard deviation, NanoSim will use log-normal distribution to approximate the read length distribution ./simulator.py genome -dna_type linear -rg human_ref.fasta -c human -med 5000 -sd 1.05

6 If you want to simulate ten thousands cDNA/directRNA reads from mouse reference transcriptome
./simulator.py transcriptome -rt Mus_musculus.GRCm38.cdna.all.fa -rg Mus_musculus.GRCm38.dna.primary_assembly.fa -c mouse_cdna -e abundance.tsv -n 10000

7 If you want to simulate five thousands cDNA/directRNA reads from mouse reference transcriptome without modeling intron retention
./simulator.py transcriptome -rt Mus_musculus.GRCm38.cdna.all.fa -c mouse_cdna -e abundance.tsv -n 5000 --no_model_ir

8 If you want to simulate two thousands cDNA/directRNA reads from human reference transcriptome with polya tails, mimicking homopolymer bias (starting from homopolymer length >= 6) and reads in fastq format
./simulator.py transcriptome -rt Homo_sapiens.GRCh38.cdna.all.fa -c Homo_sapiens_model -e abundance.tsv -rg Homo_sapiens.GRCh38.dna.primary.assembly.fa --polya transcripts_with_polya_tails --fastq -k 6 --basecaller guppy -r dRNA

Explanation of output files

1. Characterization stage

1.1 Characterization stage (genome)

training_aligned_region.pkl Kernel density function of aligned regions on aligned reads
training_aligned_reads.pkl Kernel density function of aligned reads
training_ht_length.pkl Kernel density function of unaligned regions on aligned reads
training_besthit.maf/sam The best alignment of each read based on length
training_match.hist/training_mis.hist/training_del.hist/training_ins.hist Histogram of match, mismatch, and indels
training_first_match.hist Histogram of the first match length of each alignment
training_error_markov_model Markov model of error types
training_ht_ratio.pkl Kernel density function of head/(head + tail) on aligned reads
training.maf/sam The alignment output
training_match_markov_model Markov model of the length of matches (stretches of correct base calls)
training_model_profile Fitted model for errors
training_processed.maf A re-formatted MAF file for user-provided alignment file
training_unaligned_length.pkl Kernel density function of unaligned reads
training_error_rate.tsv Mismatch rate, insertion rate and deletion rate
training_strandness_rate Strandness rate in input reads.

1.1 Characterization stage (transcriptome)

training_aligned_region.pkl Kernel density function of aligned regions on aligned reads
training_aligned_region_2d.pkl Two-dimensional kernel density function of aligned regions over the length of reference transcript they aligned
training_aligned_reads.pkl Kernel density function of aligned reads
training_ht_length.pkl Kernel density function of unaligned regions on aligned reads
training_besthit.maf/sam The best alignment of each read based on length
training_match.hist/training_mis.hist/training_del.hist/training_ins.hist Histogram of match, mismatch, and indels
training_first_match.hist Histogram of the first match length of each alignment
training_error_markov_model Markov model of error types
training_ht_ratio.pkl Kernel density function of head/(head + tail) on aligned reads
training.maf/sam The alignment output
training_match_markov_model Markov model of the length of matches (stretches of correct base calls)
training_model_profile Fitted model for errors
training_processed.maf A re-formatted MAF file for user-provided alignment file
training_unaligned_length.pkl Kernel density function of unaligned reads
training_error_rate.tsv Mismatch rate, insertion rate and deletion rate
training_strandness_rate Strandness rate in input reads.
training_addedintron_final.gff3 gff3 file format containing the intron coordinate information
training_IR_info List of transcripts in which there is a retained intron based on IR modeling step
training_IR_markov_model Markov model of Intron Retention events

2. Simulation stage

simulated_reads.fasta FASTA file of simulated reads. Each reads has "unaligned", "aligned", or "perfect" in the header determining their error rate. "unaligned" means that the reads have an error rate over 90% and cannot be aligned. "aligned" reads have the same error rate as training reads. "perfect" reads have no errors.

To explain the information in the header, we have two examples:

>ref|NC-001137|-[chromosome=V]_468529_unaligned_0_F_0_3236_0
All information before the first _ are chromosome information. 468529 is the start position and unaligned suggesting it should be unaligned to the reference. The first 0 is the sequence index. F represents a forward strand. 0_3236_0 means that sequence length extracted from the reference is 3236 bases.
>ref|NC-001143|-[chromosome=XI]_115406_aligned_16565_R_92_12710_2
This is an aligned read coming from chromosome XI at position 115406. 16565 is the sequence index. R represents a reverse complement strand. 92_12710_2 means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.

The information in the header can help users to locate the read easily.

Specific to transcriptome simulation: for reads that include retained introns, the header contains the information starting from Retained_intron, each genomic interval is separated by ;.

simulated_error_profile Contains all the information of errors introduced into each reads, including error type, position, original bases and current bases.

Acknowledgements

Sincere thanks to our labmates and all contributors and users of this tool.

sailfish009/NanoSim

Dependencies

Usage

1. Characterization stage

Downloads

2. Simulation stage

Explanation of output files

1. Characterization stage

1.1 Characterization stage (genome)

1.1 Characterization stage (transcriptome)

2. Simulation stage

Acknowledgements