/MeSS

Snakemake pipeline for simulating shotgun metagenomic samples

Primary LanguagePythonMIT LicenseMIT

Metagenomic Sequence Simulator (MeSS)

license install with bioconda version downloads

tests docs docker

DOI

The Metagenomic Sequence Simulator (MeSS) is a Snakemake pipeline, implemented using Snaketool, for simulating illumina, Oxford Nanopore (ONT) and Pacific Bioscience (PacBio) shotgun metagenomic samples.

🔍 Overview

MeSS takes as input NCBI taxa or local genome assemblies to generate either long (PacBio or ONT) or short (illumina) reads. In addition to reads, MeSS optionally generates bam alignment files and taxonomic + sequence abundances in CAMI format.

%%{init: {'theme':'forest'}}%%
flowchart LR
input["samples.tsv
or
samples/*.tsv"] --> taxons

subgraph genome_download["genome download"]
dlchoice{download ?}
taxons["taxons or
accesions"] --> dlchoice
dlchoice -->|yes| assembly_finder
dlchoice -->|no| fasta
assembly_finder --> fasta
end
style genome_download color:#15161a

input --> distchoice
subgraph community_design["`**community design**`"]
distchoice{draw distribution ?}
distchoice -->|yes| dist["distribution
(lognormal, even)"]
dist --> abundances
distchoice -->|no| reads
distchoice -->|no| bases
distchoice -->|no| abundances
depth["coverage depth"]
reads --> depth
bases --> depth
abundances["abundances
(sequence, taxonomic)"] --> depth
end
style community_design color:#15161a
style community_design color:#15161a

fasta --> simulator
depth --> simulator

simulator["read simulator
(art_illumina, pbsim3...)"]
simulator --> bam
simulator --> fastq
simulator --> CAMI-profile

%% subgraph color fills
classDef red fill:#faeaea,color:#fff,stroke:#333;
classDef blue fill:#eaecfa,color:#fff,stroke:#333;
class genome_download blue

class community_design red
Loading

📚 Documentation

More details can be found in the documentation

⚡ Quick start

⚙️ Installation

conda create -n mess mess
  • Docker
docker pull ghcr.io/metagenlab/mess:latest
  • From source
git clone https://github.com/metagenlab/MeSS.git
pip install -e MeSS

📄 Usage

➡️ Input

Let's simulate two metagenomic samples with the following taxa and read counts in samples.tsv:

sample taxon reads
sample1 487 174840
sample1 727 90679
sample1 729 13129
sample2 28132 147863
sample2 199 147545
sample2 729 131300

🚀 Command

mess run -i samples.tsv

Important

Apptainer is the default and recommended dependency deployment method for maximum reproducibility !

If you would like to use conda you can specify --sdm conda.

🗂️ Outputs

  • Downloaded genomes in mess_out/assembly_finder/download
┣ 📂GCF_000144405.1
┃ ┗ 📜GCF_000144405.1_ASM14440v1_genomic.fna.gz
┣ 📂GCF_001298465.1
┃ ┗ 📜GCF_001298465.1_ASM129846v1_genomic.fna.gz
┣ 📂GCF_016127215.1
┃ ┗ 📜GCF_016127215.1_ASM1612721v1_genomic.fna.gz
┣ 📂GCF_020736045.1
┃ ┗ 📜GCF_020736045.1_ASM2073604v1_genomic.fna.gz
┣ 📂GCF_022869645.1
 ┗ 📜GCF_022869645.1_ASM2286964v1_genomic.fna.gz
  • Simulated reads in mess_out/fastq
┣ 📜sample1_R1.fq.gz
┣ 📜sample1_R2.fq.gz
┣ 📜sample2_R1.fq.gz
┗ 📜sample2_R2.fq.gz

Tip

By default mess outputs paired illumina reads with the Hiseq25k error profile. Other outputs, and error profiles are described here and here

📊 Resources usage

Using samples.tsv, mess runs in under 2min, while using around 1.8GB of physical RAM

task_id hash native_id name status exit submit duration realtime %cpu peak_rss peak_vmem rchar wchar
1 fe/03c2bc 62286 MESS (1) COMPLETED 0 2024-09-04 12:41:15.820 1m 50s 1m 50s 111.5% 1.8 GB 9 GB 3.5 GB 2.4 GB
1 ff/0d03b1 73355 MESS (1) COMPLETED 0 2024-09-04 12:55:12.903 1m 52s 1m 52s 112.6% 1.7 GB 8.8 GB 3.5 GB 2.4 GB
1 07/d352bf 83576 MESS (1) COMPLETED 0 2024-09-04 12:57:30.600 1m 50s 1m 50s 113.2% 1.7 GB 8.9 GB 3.5 GB 2.4 GB

Note

Average resources usage measured 3 times with one CPU (using nextflow, excluding dependency deployment time).

More details in the resource usage documentation

🔥 Features

Using phage.tsv

sample taxon cov_sim
phage 347329 200

🧬 Multi sequencing technology

  • Illumina
mess run -i phage.tsv --tech illumina -o mess_out/illumina
seqkit stats --all -T -b mess_out/illumina/fastq/*
file num_seqs sum_len avg_len N50 Q20(%) Q30(%) AvgQual
phage_R1.fq.gz 44000 6600000 150.0 150 98.01 91.67 27.81
phage_R2.fq.gz 44000 6600000 150.0 150 97.31 89.65 26.52
  • Nanopore
mess run -i phage.tsv --tech nanopore -o mess_out/nanopore
seqkit stats --all -T -b mess_out/nanopore/fastq/*
file num_seqs sum_len avg_len N50 Q20(%) Q30(%) AvgQual
phage.fq.gz 1486 13203006 8884.9 12329 73.99 62.65 13.60
  • PacBio HiFi
mess run -i phage.tsv -o mess_out/pacbio --tech pacbio --error hifi
seqkit stats --all -T -b mess_out/pacbio/fastq/*
file num_seqs sum_len avg_len N50 Q20(%) Q30(%) AvgQual
phage.fq.gz 1430 12588621 8803.2 12666 99.92 99.78 40.51

Note

We use pbsim3 to simulate multi-pass CLR reads which are converted to HiFi reads with ccs.

PacBio HiFi reads simulations usually take longer compared to other error profiles.

⭕ Circular assemblies

Inspired by readSimulator's approach, mess can shuffle genome start points to get circular genome assemblies.

Warning

All contigs in the fasta will be circularised

  • Linear (default, --rotate 1)
mess run -i phage.tsv -o mess_out/linear

  • Circular (--rotate 3)
mess run -i phage.tsv --rotate 3 -o mess_out/circular

Note

Assembled using unicycler, visualized using bandage

🆘 Help

All command-line options at described here

mess -h