A fast and memory-efficient metagenome binning tool
Metagenome binning relies on the following underlying basis (i) contigs originated from the same genome will have correlated abundance profiles across samples and (ii) k-mer (tetramer) frequency is a characteristics of microbial genomes and distinguishes genomes from different genus. Thus, contigs from the same genome show correlation in k-mer frequency. Using correlation in profiles of read and k-mer counts, contigs from the same genomes could be identified in the metagenome assembly and binned into Metagenome-assembled genomes (MAGs).
McDevol uses a novel Bayesian statistics-based distance measure on read counts and k-mer profiles to bin metagenomic contigs. The method has two steps, (i) initial agglomerative clustering using bayesian distance and (ii) density-based clustering using summed read and k-mer count profiles to merge clusters of possibly the same genome into components to provide final genomic bins. An outline of algorithm is depicted below.
It requires a 64-bit Linux system with AVX2 instruction set. Check using uname -a | grep x86_64
command in Linux terminal. In addition, cmake>=3.21 and gcc>=10.2 should be available.
git clone https://github.com/yazhinia/McDevol.git --recurse-submodules
cd McDevol
pip install -r requirements.txt
bash setup.sh
export PATH=$PATH:<path to McDevol>
mcdevol.py -i test -c test/contigs.fasta -o out # test run
Now ready to use.
In linux cluster,cmake
and gcc
modules should be pre-loaded. Also, for easy installation and use, it is recommened to work in a virtual environment (create using either venv
or conda
).
(i) McDevol finds contigs belonging to the same genome using a novel distance measure, defined as the posterior probability that the count profiles of contigs are drawn from the same distribution.
(ii) It applies a simple agglomerative algorithm to get high-purity clusters followed by merging clusters of the same genomic origin through density-based clustering to increase completeness. This approach is much simpler and faster than an iterative medoid clustering and expectation-maximization algorithm used by MetaBat2 and MaxBin2, respectively.
(iii) It does not rely on a set of single-copy marker genes to refine clusters as done by other existing binners which results in over-estimation completeness and purity measures during CheckM evaluation.
Together, this tool is very fast, memory-efficient and less dependent on external tools.
mcdevol.py -i bamfiles -c contig.fasta
-i, --input
directory in which all bamfiles are present
-c, --contigs
a fasta file for contig sequences (single-sample or co-assembly)
note: input bamfiles should be unsorted (i.e., a default output of aligners and alignments are arranged by read names). As of now, McDevol supports bamfiles from bwa-mem
and bowtie2
tools.
-l, --minlength
minimum length of contigs to be considered for binning [default 1000kb]
-o, --output
the name of output file [default, 'mcdevol']
-d, --outdir
output directory [default, working directory]
--fasta
output fasta file for each bin
mcdevol.py -h or mcdevol.py
We recommend single-sample assembly to obtain contigs as it minimizes constructing ambiguous assemblies for strain genomes. Perform mapping on a concatenated list of contigs for each sample and run McDevol. Bins from single-sample assembly input are redundant because the same genomic region can be represented by multiple contigs assembled independently from different samples. To remove redundancy, we recommend the following post-binning redundancy reduction steps.
When the contigs are assembled from each sample, perform post-binning assembly and clustering on every bin produced by McDevol. For this, users are requested to have plass (https://github.com/soedinglab/plass) and MMseqs2 (https://github.com/soedinglab/MMseqs2) separately installed. The parameters specified for plass and mmseqs2-linclust are essential and we recommend for the use to keep long-contigs after assembly (as we have observed from megahit and plass nuclassemble that they chop contigs of nearly genome size into small size due to their asssembly strategies) and cluster only contigs of >=99.0% sequence identity.
plass nuclassemble bin<0..N>.fasta bin<0..N>_assembled.fasta tmp --max-seq-len 10000000 --keep-target false --contig-output-mode 0 --min-seq-id 0.990 --chop-cycle false
mmseqs easy-linclust bin<0..N>.fasta output<0..N> tmp --min-seq-id 0.970 --min-aln-len 200 --cluster-mode 2 --shuffle 0 -c 0.99 --cov-mode 1 --max-seq-len 10000000