The provided simulated reads have been used to evaluate ClassificationByNumbers, AlignmentByNumbers, and AssemblyByNumbers tools. The tools are discussed in the “The utility of data transformation for alignment, de novo assembly and of short read virus sequences” manuscript (https://www.preprints.org/manuscript/201904.0014/v1), which have been submitted to Viruses journal special issue Virus Bioinformatics. Reference genome hxb2.fasta is used to both classify and align the simulated HXB2 reads and Refs_for_Mixed_Viruses.fasta is used to both classify and align the simulated Mixed_Viruses reads.
CuReSim(1) was used to generate 16 single viruse datasests from the HIV-HXB2 genome (K03455.1), included in file hxb2.fasta. Each of the datasets is generated with different levels and types of variation.
WGSIM(2) was used to generate 4 mixed viruses datasests from the Norovirus genome (KM198509.1), the Ebola virus genome (KM034562.1), and the Human respiratory syncytial virus genome (KP317934.1), included in file Refs_for_Mixed_Viruses.fasta. Each of the datasets is generated with different levels and types of variation.
1. Caboche, S.; Audebert, C.; Lemoine, Y.; Hot, D. Comparison of mapping algorithms used in high-throughput sequencing: application to Ion Torrent data. BMC genomics 2014, 15, 264.
2. https://github.com/lh3/wgsi