Metagenomic Sequence Simulator (MeSS)

The Metagenomic Sequence Simulator (MeSS) is a Snakemake pipeline, implemented using Snaketool, for simulating illumina, Oxford Nanopore (ONT) and Pacific Bioscience (PacBio) shotgun metagenomic samples.

🔍 Overview

MeSS takes as input NCBI taxa or local genome assemblies to generate either long (PacBio or ONT) or short (illumina) reads. In addition to reads, MeSS optionally generates bam alignment files and taxonomic + sequence abundances in CAMI format.

%%{init: {'theme':'forest'}}%%
flowchart LR
input["samples.tsv
or
samples/*.tsv"] --> taxons

subgraph genome_download["genome download"]
dlchoice{download ?}
taxons["taxons or
accesions"] --> dlchoice
dlchoice -->|yes| assembly_finder
dlchoice -->|no| fasta
assembly_finder --> fasta
end
style genome_download color:#15161a

input --> distchoice
subgraph community_design["`**community design**`"]
distchoice{draw distribution ?}
distchoice -->|yes| dist["distribution
(lognormal, even)"]
dist --> abundances
distchoice -->|no| reads
distchoice -->|no| bases
distchoice -->|no| abundances
depth["coverage depth"]
reads --> depth
bases --> depth
abundances["abundances
(sequence, taxonomic)"] --> depth
end
style community_design color:#15161a
style community_design color:#15161a

fasta --> simulator
depth --> simulator

simulator["read simulator
(art_illumina, pbsim3...)"]
simulator --> bam
simulator --> fastq
simulator --> CAMI-profile

%% subgraph color fills
classDef red fill:#faeaea,color:#fff,stroke:#333;
classDef blue fill:#eaecfa,color:#fff,stroke:#333;
class genome_download blue

class community_design red

📚 Documentation

More details can be found in the documentation

⚡ Quick start

⚙️ Installation

Conda (Miniforge)

conda create -n mess mess

Docker

docker pull ghcr.io/metagenlab/mess:latest

From source

git clone https://github.com/metagenlab/MeSS.git
pip install -e MeSS

📄 Usage

➡️ Input

Let's simulate two metagenomic samples with the following taxa and read counts in samples.tsv:

sample	taxon	reads
sample1	487	174840
sample1	727	90679
sample1	729	13129
sample2	28132	147863
sample2	199	147545
sample2	729	131300

🚀 Command

mess run -i samples.tsv

Important

Apptainer is the default and recommended dependency deployment method for maximum reproducibility !

If you would like to use conda you can specify --sdm conda.

🗂️ Outputs

Downloaded genomes in mess_out/assembly_finder/download

┣ 📂GCF_000144405.1
┃ ┗ 📜GCF_000144405.1_ASM14440v1_genomic.fna.gz
┣ 📂GCF_001298465.1
┃ ┗ 📜GCF_001298465.1_ASM129846v1_genomic.fna.gz
┣ 📂GCF_016127215.1
┃ ┗ 📜GCF_016127215.1_ASM1612721v1_genomic.fna.gz
┣ 📂GCF_020736045.1
┃ ┗ 📜GCF_020736045.1_ASM2073604v1_genomic.fna.gz
┣ 📂GCF_022869645.1
 ┗ 📜GCF_022869645.1_ASM2286964v1_genomic.fna.gz

Simulated reads in mess_out/fastq

┣ 📜sample1_R1.fq.gz
┣ 📜sample1_R2.fq.gz
┣ 📜sample2_R1.fq.gz
┗ 📜sample2_R2.fq.gz

Tip

By default mess outputs paired illumina reads with the Hiseq25k error profile. Other outputs, and error profiles are described here and here

📊 Resources usage

Using samples.tsv, mess runs in under 2min, while using around 1.8GB of physical RAM

task_id	hash	native_id	name	status	submit	duration	realtime	%cpu	peak_rss	peak_vmem	rchar	wchar
1	fe/03c2bc	62286	MESS (1)	COMPLETED	2024-09-04 12:41:15.820	1m 50s	1m 50s	111.5%	1.8 GB	9 GB	3.5 GB	2.4 GB
1	ff/0d03b1	73355	MESS (1)	COMPLETED	2024-09-04 12:55:12.903	1m 52s	1m 52s	112.6%	1.7 GB	8.8 GB	3.5 GB	2.4 GB
1	07/d352bf	83576	MESS (1)	COMPLETED	2024-09-04 12:57:30.600	1m 50s	1m 50s	113.2%	1.7 GB	8.9 GB	3.5 GB	2.4 GB

Note

Average resources usage measured 3 times with one CPU (using nextflow, excluding dependency deployment time).

More details in the resource usage documentation

🔥 Features

Using phage.tsv

sample	taxon	cov_sim
phage	347329	200

🧬 Multi sequencing technology

Illumina

mess run -i phage.tsv --tech illumina -o mess_out/illumina
seqkit stats --all -T -b mess_out/illumina/fastq/*

file	num_seqs	sum_len	avg_len	N50	Q20(%)	Q30(%)	AvgQual
phage_R1.fq.gz	44000	6600000	150.0	150	98.01	91.67	27.81
phage_R2.fq.gz	44000	6600000	150.0	150	97.31	89.65	26.52

Nanopore

mess run -i phage.tsv --tech nanopore -o mess_out/nanopore
seqkit stats --all -T -b mess_out/nanopore/fastq/*

file	num_seqs	sum_len	avg_len	N50	Q20(%)	Q30(%)	AvgQual
phage.fq.gz	1486	13203006	8884.9	12329	73.99	62.65	13.60

PacBio HiFi

mess run -i phage.tsv -o mess_out/pacbio --tech pacbio --error hifi
seqkit stats --all -T -b mess_out/pacbio/fastq/*

file	num_seqs	sum_len	avg_len	N50	Q20(%)	Q30(%)	AvgQual
phage.fq.gz	1430	12588621	8803.2	12666	99.92	99.78	40.51

Note

We use pbsim3 to simulate multi-pass CLR reads which are converted to HiFi reads with ccs.

PacBio HiFi reads simulations usually take longer compared to other error profiles.

⭕ Circular assemblies

Inspired by readSimulator's approach, mess can shuffle genome start points to get circular genome assemblies.

Warning

All contigs in the fasta will be circularised

Linear (default, --rotate 1)

mess run -i phage.tsv -o mess_out/linear

Circular (--rotate 3)

mess run -i phage.tsv --rotate 3 -o mess_out/circular

Note

Assembled using unicycler, visualized using bandage

🆘 Help

All command-line options at described here