McDevol: A Python repository from yazhinia

McDevol

A fast and memory-efficient metagenome binning tool

Introduction

Metagenome binning relies on the following underlying basis (i) contigs originated from the same genome will have correlated abundance profiles across samples and (ii) k-mer (tetramer) frequency is a characteristics of microbial genomes and distinguishes genomes from different genus. Thus, contigs from the same genome show correlation in k-mer frequency. Using correlation in profiles of read and k-mer counts, contigs from the same genomes could be identified in the metagenome assembly and binned into Metagenome-assembled genomes (MAGs).

Algorithm

McDevol uses a novel Bayesian statistics-based distance measure on read counts and k-mer profiles to bin metagenomic contigs. The method has two steps, (i) initial agglomerative clustering using bayesian distance and (ii) density-based clustering using summed read and k-mer count profiles to merge clusters of possibly the same genome into components to provide final genomic bins. An outline of algorithm is depicted below.

Installation

It requires a 64-bit Linux system with AVX2 instruction set. Check using uname -a | grep x86_64 command in Linux terminal. In addition, cmake>=3.21 and gcc>=10.2 should be available.

  git clone https://github.com/yazhinia/McDevol.git --recurse-submodules
  cd McDevol
  pip install -r requirements.txt
  bash setup.sh
  export PATH=$PATH:<path to McDevol>
  mcdevol.py -i test -c test/contigs.fasta -o out # test run

Now ready to use.

In linux cluster,cmake and gcc modules should be pre-loaded. Also, for easy installation and use, it is recommened to work in a virtual environment (create using either venv or conda).

Advantages

(i) McDevol finds contigs belonging to the same genome using a novel distance measure, defined as the posterior probability that the count profiles of contigs are drawn from the same distribution.

(ii) It applies a simple agglomerative algorithm to get high-purity clusters followed by merging clusters of the same genomic origin through density-based clustering to increase completeness. This approach is much simpler and faster than an iterative medoid clustering and expectation-maximization algorithm used by MetaBat2 and MaxBin2, respectively.

(iii) It does not rely on a set of single-copy marker genes to refine clusters as done by other existing binners which results in over-estimation completeness and purity measures during CheckM evaluation.

Together, this tool is very fast, memory-efficient and less dependent on external tools.

Command line

mcdevol.py -i bamfiles -c contig.fasta

-i, --input directory in which all bamfiles are present

-c, --contigs a fasta file for contig sequences (single-sample or co-assembly)

note: input bamfiles should be unsorted (i.e., a default output of aligners and alignments are arranged by read names). As of now, McDevol supports bamfiles from bwa-mem and bowtie2 tools.

Additional options

-l, --minlength minimum length of contigs to be considered for binning [default 1000kb]

-o, --output the name of output file [default, 'mcdevol']

-d, --outdir output directory [default, working directory]

--fasta output fasta file for each bin

Help

mcdevol.py -h or mcdevol.py

Recommended workflow

We recommend single-sample assembly to obtain contigs as it minimizes constructing ambiguous assemblies for strain genomes. Perform mapping on a concatenated list of contigs for each sample and run McDevol. Bins from single-sample assembly input are redundant because the same genomic region can be represented by multiple contigs assembled independently from different samples. To remove redundancy, we recommend the following post-binning redundancy reduction steps.

Metagenome binning of contigs from sample-wise assembly

When the contigs are assembled from each sample, perform post-binning assembly and clustering on every bin produced by McDevol. For this, users are requested to have plass (https://github.com/soedinglab/plass) and MMseqs2 (https://github.com/soedinglab/MMseqs2) separately installed. The parameters specified for plass and mmseqs2-linclust are essential and we recommend for the use to keep long-contigs after assembly (as we have observed from megahit and plass nuclassemble that they chop contigs of nearly genome size into small size due to their asssembly strategies) and cluster only contigs of >=99.0% sequence identity.

1) post-binning assembly

  plass nuclassemble bin<0..N>.fasta bin<0..N>_assembled.fasta tmp --max-seq-len 10000000 --keep-target false --contig-output-mode 0 --min-seq-id 0.990 --chop-cycle false

2) sequence clustering

  mmseqs easy-linclust bin<0..N>.fasta output<0..N> tmp --min-seq-id 0.970 --min-aln-len 200 --cluster-mode 2 --shuffle 0 -c 0.99 --cov-mode 1 --max-seq-len 10000000