Metagenomics simulation workflow

Scripts to generate and process simulated metagenomics reads with MetaSim and Cope

The simulation is setup as follows:

Read are simulated in fasta format (no quality scores);
The median insert length of paired end (PE) reads is 550nt;
All simulated reads will be PE;
Reads are simulated error-free;
16 processors are used in the simulation step.

Read simulation: Metasim

Metasim software is used to simulate metagenomics read datasets. Use of Metasim is aided by Metasim.sh script, which is intended to simulate nine datasets in a single run. These datasets represent the same community (described in the *.mprf file, see below) simulated with combination of three different read lengths (75bp, 150bp, 300bp) and three different dataset sizes (2 x 50,000, 2 x 500,000, 2 x 5,000,000).
Use of this script requires that the Metasim software is installed and the related species database is properly set up. Full documentation of Metasim is available at http://ab.inf.uni-tuebingen.de/data/software/metasim/download/V0_9_5/manual.pdf.

Metasim.sh requires metasim_simulation_table_1, a table file with rows like this:

dataset_0001	50000			75      		gut_viral_metasim_1000_90sim_allEqual.mprf  
dataset_0002    500000  		75      		gut_viral_metasim_1000_90sim_allEqual.mprf  
dataset_0003    5000000			75      		gut_viral_metasim_1000_90sim_allEqual.mprf  
dataset_0004    50000			150     		gut_viral_metasim_1000_90sim_allEqual.mprf 
<dataset_name>	<n_reads>		<read_length>	<abundance_file>

columns are described as follows:

<dataset name> is used for naming the output folder and the simulated read (compressed) file;
<n_reads> is the number of reads to be simulated. When PE reads are simulated, this number refers to fwd and rev datasets;
<read length> is the length of reads to be simulated;
<abundance_file>. Abundance file describing the relative abundance of each species in the dataset to be simulated. Following, some sample lines:

1						gi			197261329
1						gi			431811082
1						gi			418488446
1						gi			282598847
1						gi			308814351
1						gi			77864625
<relative_abundance>	<id_type>	<id>

columns are described as follows:

<relative abundance> is a relative number of arbitrary size reflecting the copy number of the genome sequences of this specific taxon in the total taxon set.
<id_type> corresponds to the column names of the database suitable to identify a genome sequence. This can be "taxid", "gi" or "name".
<id> is the corresponding value the for chosen key identifer.

#####Suggested use of Metasim.sh:
run it from within the Metasim software folder. You can specify the working directory as an argument of the script, i.e. Metasim.sh <working_dir> (please provide full path).

Read processing (1): SplitMetasimPairs.sh

The next step in the simulation is the read connection, i.e. merging of overlapping ends of reads in a PE to generate a longer overlapped read. The software Cope is used to this aim. However, Cope needs two separated files for forward and reverse reads, whereas Metasim outputs simulated reads in interleaved format, e.g.

>r2.1 |SOURCES={GI=197261329,fw,29451-29526}|ERRORS={}|SOURCE_1="Pseudomonas phage LBL3" (2b16a6cd7f5bb301549640d6e87180794d8b66ec)
ACGAAATCAGAAAAGTGCTTCAGCTAAGATATGTAGCGCTGATTTCGAACGGCTCGATTGCATAT
ATCAATCGCA
>r2.2 |SOURCES={GI=197261329,bw,29732-29807}|ERRORS={}|SOURCE_1="Pseudomonas phage LBL3" (2b16a6cd7f5bb301549640d6e87180794d8b66ec)
AGCGGAGCCGTCCGGGGCCGTAGCGGACGCCGTTTCGACCGTTACCGGAAGACCGCTCGCGGCCG
GATCGACTCC
>r1.1 |SOURCES={GI=197261329,bw,32107-32182}|ERRORS={}|SOURCE_1="Pseudomonas phage LBL3" (2b16a6cd7f5bb301549640d6e87180794d8b66ec)
CAAACAGAGGGCGATTGTTGAATATGACGTTCGTGGACGCCCAGTCCCATCGCATGATTTGCGAG
CGATAAGAGC
>r1.2 |SOURCES={GI=197261329,fw,31847-31922}|ERRORS={}|SOURCE_1="Pseudomonas phage LBL3" (2b16a6cd7f5bb301549640d6e87180794d8b66ec)
CTCTGAAGCTGATGGGTCGATGGCGTTCTATTCTCGCCAGGGGTCCGCTGGTCCTACCCAGGATA
TCCTGTTCAG

The script SplitMetasimPairs.py splits a file dataset_000?-reads.fa.gz generated by the script Metasim.sh in two files: dataset_000?-reads.2.fa and dataset_000?-reads.2.fa. Please note that SplitMetasimPairs.py requires the Python library BioPython. The SplitMetasimPairs.sh script is provided to automate the execution of the SplitMetasimPairs.py on the group of nine datasets generated by a single run of Metasim.sh.

#####Suggested use of SplitMetasimPairs.sh: Put both SplitMetasimPairs.sh and SplitMetasimPairs.py in your PATH and run them in the same folder of the dataset_000? folders. Please note that SplitMetasimPairs.sh is configured to run nine array jobs.

Read processing (2): Cope

Cope.sh iterates COPE on the group of nine datasets generated by a single run of Metasim.sh. for each dataset, it outputs three files:

dataset_000?-reads.R1.fa;
dataset_000?-reads.R2.fa;
dataset_000?-reads.merged.fa
which can then be used with PathoScope.

#####Suggested use of Cope.sh: Put Cope.sh in your PATH and run them in the same folder of the dataset_000? folders.

Hints and tips

...