nuMetaSim (non-uniform Metagenomic sequencing Simulation system)

Introduction

nuMetaSim is a sequencing data simulation system designed for comprehensive metegenome data simulation by Illumina platform, including nearly all possible features such as sequencing error, quality score, coverage bias, contamination source like human and unknown reads in realistic human microbiome samples.

nuMetaSim is developed with Python. Both single-end and pair-end data generation are supported using nuMetaSim.py and nuMetaSim_pe.py, respectively. Other related tools are also provided in scripts in the nuMetaSim/tools/ folder.

The author suggests running nuMetaSim in a widely-used Python distribution – Anaconda, because most of nuMetaSim’s module dependencies are satisfied in Anaconda environment. The only required missing third-party package that you need to install manually is intervaltree. It can be downloaded at https://pypi.python.org/pypi/intervaltree.

Usage

Main usage with examples

Note:nuMetaSim_pe.py usage is nearly the same with nuMetaSim.py. It only has one more parameter -g, "gap_length", which controls the insertion length in pair-end simulated data, compared with nuMetaSim.py. So here only nuMetaSim.py will be introduced.

Generate fasta data (no sequencing error allowed)

python nuMetaSim.py --index microbes_index --propor microbes_proportion_table -F fastaFile2Write_prefix -n read_number -L read_length
  • --index: An index file with four columns containing meta information including NCBI accession number, genome length, genome name and the absolute storage path of the reference genome. This file is generated by microbe_index.py in nuMetaSim/tools/index/.
  • --propor: A tab-delimited relative abundance table of each component microbe genome with two columns including genome name extracted by microbe_index.py and abundance value. The sum of all the abundance values should be 1, otherwise program will not work and print out error message.
  • -F: The prefix name of the simulated sample to be generated.
  • -n: Total read number
  • -L: Fixed read length.

Generate fastq data with sequencing error based on a quality score profile

python nuMetaSim.py --index microbes_index --propor microbes_proportion_table -F fastqFile2Write_prefix -n read_number -L read_length --fa_fq_flag 1 --qual quality_score_profile
  • --fa_fq_flag: A 0-1 flag to control whether to generate fasta format data or fastq format data. 0 means fasta, 1 means fastq.
  • --qual: Quality score profile extracted by extract_quality_score_distribution.py.

Generate fastq data with sequencing error and a specific sequencing error rate

python nuMetaSim.py --index microbes_index --propor microbes_proportion_table -F fastqFile2Write_prefix -n read_number -L read_length --fa_fq_flag 1 --qual quality_score_profile --err_adj_flag 1 --err_rate 0.01
  • --err_adj_flag: A 0-1 flag to control whether to adjust the base error probability derived from quality scores. 1 means adjustment will be operated, 0 means the opposite.
  • --err_rate: Total base error rate, only works when --err_adj_flag is set to be 1.

Generate fastq data with user-input read distribution table

python nuMetaSim.py --index microbes_index --propor microbes_proportion_table -F fastqFile2Write_prefix -n read_number -L read_length --fa_fq_flag 1 --qual quality_score_profile --dis_flag 1 --distri read_distribution_table
  • --dis_flag: A 0-1 flag to control whether to use the built-in distribution or the user-input distribution table. 0 means built-in distribution will be used, 1 means user-input distribution table will be used.
  • --distri: A user-input read distribution table representing the coverage bias.

Generate fastq data with human reads and unknown reads using even relative abundance

python nuMetaSim.py --index microbes_index --propor microbes_proportion_table -F fastqFile2Write_prefix -n read_number -L read_length --fa_fq_flag 1 --qual quality_score_profile --h_flag 1 --h_ref_dir human_reference_folder_path --h_ratio 0.1 --unk_flag 1 --unk_ref_dir unknown_reference_folder_path unk_ratio 0.1
  • --h_flag: A 0-1 flag to control whether to add human reads to the data or not. 0 means do not add human reads, 1 means add human reads.
  • --h_ref_dir: The path of the folder containing all the human reference sequences. No other sequences except human reference sequences are allowed to be included in this folder.
  • --h_ratio: Human contamination ratio.
  • --unk_flag: A 0-1 flag to control whether to add unknown reads to the data or not. 0 means do not add unknown reads, 1 means add unknown reads.
  • --unk_ref_dir: The path of the folder containing all the unknown reference sequences. No other sequences except unknown reference sequences are allowed to be included in this folder.
  • --unk_ratio: Unknown contamination ratio.

Generate fastq data with human reads and unknown reads using user-defined relative abundance

python nuMetaSim.py --index microbes_index --propor microbes_proportion_table -F fastqFile2Write_prefix -n read_number -L read_length --fa_fq_flag 1 --qual quality_score_profile --h_flag 1 --h_ref_dir human_reference_folder_path --h_ratio 0.1 --h_index human_reference_index --h_propor human_proportion_table --unk_flag 1 --unk_ref_dir unknown_reference_folder_path unk_ratio 0.1 --unk_index unknown_reference_index --unk_propor unknown_proportion_table
  • --h_index: Human reference sequence index, similar with microbe index, but only contains three columns, which are sequence name, sequence length and storage path.
  • --h_propor: A tab-delimited proportion table of all the human reference sequences defined by the user.
  • --unk_index: Unknown reference sequence index, similar with microbe index, but only contains three columns, which are sequence name, sequence length and storage path.
  • --unk_propor: A tab-delimited proportion table of all the unknown reference sequences defined by the user.

Other parameters

  • -v: A 0-1 flag to control whether to use a variable random seed in each run. 0 means fixed random seed, 1 means variable random seed.
  • -r: A fixed random seed set by the user. This parameter can ensure the reproducibility of the generated sample.
  • -b: Bin size. A bin is a sliding widow scanning the input reference genomes without overlap. It is the basic local genomic unit used to generate reads.
  • -g: Gap length of the pair-end data. This parameter is defined in nuMetaSim_pe.py.

Tool scripts

tools/index

microbe_index.py
python microbe_index.py -d reference_genome_folder
  • -d: The path of the folder containing all the component microbe reference genomes

The output file is named as index, storing in reference_genome_folder/../genome_index/.

humanOrUnknown_index.py
python humanOrUnknown_index.py -d reference_sequence_folder
  • -d: The path of the folder containing all the human (or unknown) reference sequences.

The output file is named as noisy_data_index, storing in reference_sequence_folder/../genome_index/.

tools/read distribution

Direct mapping
cal_genome_distribution.py
python cal_genome_distribution.py --fq_path fastq_sample_path --geo_path reference_genome_folder -o output_folder -b bin_size
  • --fq_path: The path of the input fastq format sample.
  • --geo_path: The path of the folder containing all the reference genomes to be mapped to.
  • -o: The path of the output folder for saving the generated read distribution profile
  • -b: The window size for mapping real sequencing reads.

The output file is named as distribution.txt, storing in output_folder.

GC-content-based coverage bias
cal_GC_coverage_relation.py
python cal_GC_coverage_relation.py --fq_path single_genome_sequencing_samples --geo_path corresponding_reference_genomes -o output_folder -b bin_size
  • --fq_path: The path of the folder containing all the single-genome sequencing samples.
  • --geo_path: The path of the folder containing all the corresponding reference genomes of the single-genome sequencing samples.
  • -o: The path of the output folder for saving the generated GC-content-based coverage bias file.
  • -b: The window size for mapping single-genome sequencing samples to their corresponding reference genomes.

The output file is named as GC_coverage.txt, storing in output_folder.

polyfit_GC_coverage.py
python polyfit_GC_coverage.py -f GC_content_based_coverage_bias -p polyfit_order
  • -f: The path of the GC-content-based coverage bias file generated by cal_GC_coverage_relation.py.
  • -p: The polyfit order for fitting the "GC-content-normalized read count" relationship. The recommendatory order is 2.

The output fitting coefficient file is named as polyfit_coefficient.txt, storing in output_folder. A fitting curve named as "GC_coverage.pdf" is generated in output_folder too.

cal_genome_distribution_by_GC.py
python cal_genome_distribution_by_GC.py -f GC_fitting_coefficient --geo_path reference_genome_folder -b bin_size
  • -f: The path of the fitting coefficient file generated by polyfit_GC_coverage.py.
  • --geo_path: The path of the folder containing all the reference genomes using for read distribution calculation.
  • -b: The window size for calculating the reads proportion to be generated in a local genomic region.

The output file is named as distribution.txt, storing in the same path where the fitting coefficient file is saved.

tools/unknown reference

genome_shuffle.py
python genome_shuffle.py -d reference_genome_folder
  • -d: The path of the folder containing all the genome files to be shuffled.
cal_genome_variation.py
python cal_genome_variation.py --ref reference_genome --qry query_genome -d output_folder
  • --ref: The path of the reference genome to be aligned to.
  • --qry: The path of the query genome used for two genomes alignment.
  • -d: The path of the folder for saving the variation distribution between the reference genome and query genome. The variation file contains 6 parts: 1.relative proportion of substitution/deletion/insertion; 2.relative proportion of different substitution possibilities; 3.substitution length distribution 4.deletion length distribution 5.relative proportion of different base insertion possibilities 6.insertion length distribution

The output file is named as genome_variation.txt, storing in output_folder.

gen_genome_strain.py
python gen_genome_strain.py --geo_va genome_variation_file --geo genome_file -o output_folder -m running_mode_flag -t variation_rates -n genome_number --args normal_distribution_arguments -v variable_random_seed_flag -r random_seed
  • --geo_va: The path of the genome variation file generated by cal_genome_variation.py.
  • --geo: The path of the genome file used for "relative strain" generation.
  • -o: The path of the folder for saving generated "relative strain" genome files.
  • -m: A 0-1 Runing mode flag. 0 means the user should provide a list of variation rates. The format is like "0.1,0.2,0.3,0.4,0.5". 1 means the program will use normal distribution to generate variation rates.
  • -t: Variation rates input by the user. The generated "relative strain" genome numbers is determined by the number of input variation rates. This parameter only works when -m is set to 0.
  • -n: "Relative strain" genome numbers to be generated, only works when -m is set to 1.
  • --args: Normal distribution arguments, including mean and variation, only works when -m is set to 1. The input format is "0.1,0.1".
  • -v: A 0-1 flag to control whether to use a variable random seed in each run. 0 means fixed random seed, 1 means variable random seed.
  • -r: A fixed random seed set by the user.

tools/others

extract_quality_score_distribution.py
python extract_quality_score_distribution.py -i real_sequencing_sample -o output_folder
  • -i: The path of the real sequencing sample used for quality score profile extraction.
  • -o: The path of the folder for saving the extracted quality score profile.

The output file is named as quality_score.txt, storing in output_folder.

change_read_header.py

This tool script is needed when the benchmarked software adopted by the researcher needs a distinct header in each read's header information line.

python change_read_header.py -i input_sample_file -o output_file_name -l fasta_fastq_flag
  • -i: The path of the input simulated sample generated by nuMetaSim.py or nuMetaSim_pe.py.
  • -o: The path of the folder for saving the header-changed sample file.
  • -l: A 0-1 flag representing the input file format. 0 means the input is fasta format, 1 means the input is fastq format.