The simulation is setup as follows:
- Read are simulated in fasta format (no quality scores);
- The median insert length of paired end (PE) reads is 550nt;
- All simulated reads will be PE;
- Reads are simulated error-free;
- 16 processors are used in the simulation step.
Metasim software is used to simulate metagenomics read datasets. Use of Metasim is aided by Metasim.sh script, which is intended to simulate nine datasets in a single run. These datasets represent the same community (described in the *.mprf file, see below) simulated with combination of three different read lengths (75bp, 150bp, 300bp) and three different dataset sizes (2 x 50,000, 2 x 500,000, 2 x 5,000,000).
Use of this script requires that the Metasim software is installed and the related species database is properly set up. Full documentation of Metasim is available at http://ab.inf.uni-tuebingen.de/data/software/metasim/download/V0_9_5/manual.pdf.
Metasim.sh requires metasim_simulation_table_1, a table file with rows like this:
dataset_0001 50000 75 gut_viral_metasim_1000_90sim_allEqual.mprf
dataset_0002 500000 75 gut_viral_metasim_1000_90sim_allEqual.mprf
dataset_0003 5000000 75 gut_viral_metasim_1000_90sim_allEqual.mprf
dataset_0004 50000 150 gut_viral_metasim_1000_90sim_allEqual.mprf
<dataset_name> <n_reads> <read_length> <abundance_file>
columns are described as follows:
<dataset name>
is used for naming the output folder and the simulated read (compressed) file;<n_reads>
is the number of reads to be simulated. When PE reads are simulated, this number refers to fwd and rev datasets;<read length>
is the length of reads to be simulated;<abundance_file>
. Abundance file describing the relative abundance of each species in the dataset to be simulated. Following, some sample lines:
1 gi 197261329
1 gi 431811082
1 gi 418488446
1 gi 282598847
1 gi 308814351
1 gi 77864625
<relative_abundance> <id_type> <id>
columns are described as follows:
<relative abundance>
is a relative number of arbitrary size reflecting the copy number of the genome sequences of this specific taxon in the total taxon set.<id_type>
corresponds to the column names of the database suitable to identify a genome sequence. This can be "taxid", "gi" or "name".<id>
is the corresponding value the for chosen key identifer.
#####Suggested use of Metasim.sh:
run it from within the Metasim software folder. You can specify the working directory as an argument of the script, i.e. Metasim.sh <working_dir> (please provide full path).
The next step in the simulation is the read connection, i.e. merging of overlapping ends of reads in a PE to generate a longer overlapped read. The software Cope is used to this aim. However, Cope needs two separated files for forward and reverse reads, whereas Metasim outputs simulated reads in interleaved format, e.g.
>r2.1 |SOURCES={GI=197261329,fw,29451-29526}|ERRORS={}|SOURCE_1="Pseudomonas phage LBL3" (2b16a6cd7f5bb301549640d6e87180794d8b66ec)
ACGAAATCAGAAAAGTGCTTCAGCTAAGATATGTAGCGCTGATTTCGAACGGCTCGATTGCATAT
ATCAATCGCA
>r2.2 |SOURCES={GI=197261329,bw,29732-29807}|ERRORS={}|SOURCE_1="Pseudomonas phage LBL3" (2b16a6cd7f5bb301549640d6e87180794d8b66ec)
AGCGGAGCCGTCCGGGGCCGTAGCGGACGCCGTTTCGACCGTTACCGGAAGACCGCTCGCGGCCG
GATCGACTCC
>r1.1 |SOURCES={GI=197261329,bw,32107-32182}|ERRORS={}|SOURCE_1="Pseudomonas phage LBL3" (2b16a6cd7f5bb301549640d6e87180794d8b66ec)
CAAACAGAGGGCGATTGTTGAATATGACGTTCGTGGACGCCCAGTCCCATCGCATGATTTGCGAG
CGATAAGAGC
>r1.2 |SOURCES={GI=197261329,fw,31847-31922}|ERRORS={}|SOURCE_1="Pseudomonas phage LBL3" (2b16a6cd7f5bb301549640d6e87180794d8b66ec)
CTCTGAAGCTGATGGGTCGATGGCGTTCTATTCTCGCCAGGGGTCCGCTGGTCCTACCCAGGATA
TCCTGTTCAG
The script SplitMetasimPairs.py splits a file dataset_000?-reads.fa.gz generated by the script Metasim.sh in two files: dataset_000?-reads.2.fa and dataset_000?-reads.2.fa. Please note that SplitMetasimPairs.py requires the Python library BioPython. The SplitMetasimPairs.sh script is provided to automate the execution of the SplitMetasimPairs.py on the group of nine datasets generated by a single run of Metasim.sh.
#####Suggested use of SplitMetasimPairs.sh: Put both SplitMetasimPairs.sh and SplitMetasimPairs.py in your PATH and run them in the same folder of the dataset_000? folders. Please note that SplitMetasimPairs.sh is configured to run nine array jobs.
Cope.sh iterates COPE on the group of nine datasets generated by a single run of Metasim.sh. for each dataset, it outputs three files:
- dataset_000?-reads.R1.fa;
- dataset_000?-reads.R2.fa;
- dataset_000?-reads.merged.fa
which can then be used with PathoScope.
#####Suggested use of Cope.sh: Put Cope.sh in your PATH and run them in the same folder of the dataset_000? folders.
...