MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein sequence sets. MMseqs2 is open source GPL-licensed software implemented in C++ for Linux, MacOS, and (as beta version, via cygwin) Windows. The software is designed to run on multiple cores and servers and exhibits very good scalability. MMseqs2 can run 10000 times faster than BLAST. At 100 times its speed it achieves almost the same sensitivity. It can perform profile searches with the same sensitivity as PSI-BLAST at over 400 times its speed.
The MMseqs2 user guide is available as Github Wiki or as PDF file (Thanks to pandoc!)
Keep posted about MMseqs2/Linclust updates by following Martin on twitter.
05/01/2018 New version of Linclust manuscript uploadd to bioRxiv: Steinegger M and Soeding J. Clustering huge protein sequence sets in linear time. bioRxiv (2018). The combined MMseqs2/Linclust workflow (now the default when calling "mmseqs cluster" combines extreme speed and linear time complexity with BLAST-like sensitivity.
17/10/2017 MMseqs2 has just been published at Nature Biotechnology.
05/25/2017 We updated the Linclust manuscript. Linclust is now 3x faster and we added a metagenomic protein assembly application. Happy towel day. A pre-print can be downloaded here: Steinegger M and Soeding J. Linclust: clustering billions of protein sequences per day on a single server (2017).
30/01/2017 We added a new clustering workflow called "Linclust". Linclust can cluster sequences in linear time down to 50% sequence identity. The Metaclust95 and Metaclust50 database can be download at metaclust.mmseqs.com. A preprint can be downloaded here: Steinegger M and Soeding J. Linclust: clustering protein sequences in linear time (2017).
19/12/2016 MMseqs2 has a mascot now. It is the "little Marv" and was lovingly crafted by Yuna Kwon. Thank you so much.
07/12/2016 We added a new parameter called --max-accept. This parameter limits the amount of alignments that get accepted. Please do not use --max-seqs to limit your result size since it decreases the sensitivity of MMseqs2.
MMseqs can be installed by compiling the binary from source, download a statically compiled version, using Homebrew or Docker. MMseqs2 requires a 64-bit system (check with uname -a | grep x86_64
) with at least the SSE4.1 instruction set (check by executing cat /proc/cpuinfo | grep sse4_1
on Linux and sysctl -a | grep machdep.cpu.features | grep SSE4.1
on MacOS).
Compiling MMseqs2 from source has the advantage that it will be optimized to the specific system, which should improve its performance. To compile MMseqs2 git
, g++
(4.6 or higher) and cmake
(3.0 or higher) are needed. Afterwards, the MMseqs2 binary will be located in build/bin/
.
git clone https://github.com/soedinglab/MMseqs2.git
cd MMseqs2
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..
make
make install
export PATH=$(pwd)/bin/:$PATH
❗ Please install and use gcc
from Homebrew, if you want to compile MMseqs2 on MacOS. The default MacOS clang
compiler does not support OpenMP and MMseqs2 will not be able to run multithreaded. Use the following cmake call:
CXX="$(brew --prefix)/bin/g++-6" cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..
The following command will download the last MMseqs version, extract it and set the PATH
variable. This version runs only on linux. If you want to run it on Mac please compile it or use brew.
If your computer supports AVX2 use this (faster than SSE4.1, check by executing cat /proc/cpuinfo | grep avx2
on Linux and sysctl -a | grep machdep.cpu.leaf7_features | grep AVX2
on MacOS):
wget https://mmseqs.com/latest/mmseqs-static_avx2.tar.gz
tar xvzf mmseqs-static_avx2.tar.gz
export PATH=$(pwd)/mmseqs/bin/:$PATH
If your computer supports SSE4.1 use:
wget https://mmseqs.com/latest/mmseqs-static_sse41.tar.gz
tar xvzf mmseqs-static_sse41.tar.gz
export PATH=$(pwd)/mmseqs/bin/:$PATH
MMseqs comes with a bash command and parameter auto completion by pressing tab. The bash completion for subcommands and parameters can be installed by adding the following lines to your $HOME/.bash_profile:
if [ -f /Path to MMseqs2/util/bash-completion.sh ]; then . /Path to MMseqs2/util/bash-completion.sh fi
You can install MMseqs2 for Mac OS through Homebrew by executing the following:
brew install https://raw.githubusercontent.com/soedinglab/mmseqs2/master/Formula/mmseqs2.rb --HEAD
This will also automatically install the bash completion (you might have to execute brew install bash-completion
first). This will also work for Linuxbrew.
You can either pull the official docker image by running:
docker pull soedinglab/mmseqs2
Or build the docker image from the git repository by executing:
git clone https://github.com/soedinglab/MMseqs2.git
cd MMseqs2
docker build -t mmseqs2 .
You can use the query database "queryDB.fasta" and target database "targetDB.fasta" in the examples folder to test the search workflow. First, you need to convert the fasta files into mmseqs database format.
mmseqs createdb examples/QUERY.fasta queryDB
mmseqs createdb examples/DB.fasta targetDB
If the target database is to be used several times, it is recommended to precompute an index of the targetDB as this saves overhead computations. The index should be created on a computer that has the same amount of memory as the computer that performs the search.
Transfer of large database files via NFS quickly becomes time-limiting for MMseqs2. Therefore, ideally the database and database index file should be stored on a fast local drive.
mmseqs createindex targetDB
You need to create a temporary directory in which MMseqs2 will store intermediate results.
mkdir <fast_local_drive>/tmp
It is recommend to create this tmp on a local drive to reduce load on the NFS.
The mmseqs search
searches the queryDB
against the targetDB
. The sensitivity can be adjusted with -s
and should be adapted based on your use case (see Set sensitivity -s parameter). If you want to use alignment backtraces in later steps add the option -a
. An iterative profile search (like PSI-BLAST) can be trigged with --num-iterations
.
Please ensure that in case of large input databases tmp provides enough free space. For the disc space requirements, see the user guide. To run the search type:
mmseqs search queryDB targetDB resultDB tmp
Then convert the result database into a BLAST-tab formatted database (format: qId, tId, seqIdentity, alnLen, mismatchCnt, gapOpenCnt, qStart, qEnd, tStart, tEnd, eVal, bitScore).
mmseqs convertalis queryDB targetDB resultDB resultDB.m8
Use the option --format-mode 1
to convert the results to pairwise alignments. Make sure that you searched with the option -a
(mmseqs search ... -a
).
mmseqs convertalis queryDB targetDB resultDB resultDB.pair --format-mode 1
Before clustering, convert your database into the mmseqs database format:
mmseqs createdb examples/DB.fasta DB
Create a temporary folder for the run.
mkdir tmp
Please ensure that in case of large input databases the temporary folder tmp provides enough free space. For the disc space requirements, see the user guide.
mmseqs cluster DB clu tmp
To generate a FASTA-style formatted output file from the ffindex output file, type:
mmseqs createseqfiledb DB clu clu_seq
mmseqs result2flat DB DB clu_seq clu_seq.fasta
To generate a TSV-style formatted output file from the ffindex output file, type:
mmseqs createtsv DB DB clu clu.tsv
To extract the representative sequences from the clustering result call:
mmseqs result2repseq DB DB_clu DB_clu_rep
mmseqs result2flat DB DB DB_clu_rep DB_clu_rep.fasta --use-fasta-header
Read more about the format here.
When using MMseqs the available memory limits the size of database you will be able to compute. We recommend at least 128 GB of RAM so you can compute databases up to 30.000.000 entries:
You can calculate the memory requirements in bytes for L columns and N rows using the following formula:
M = (7 × N × L) byte + (8 × a^k) byte
MMseqs stores an index table and two auxiliary arrays, which have a total size of M byte
.
For a database containing N sequences with an average length L, the memory consumption of the index table is (7 × N × L) byte
.
Note that the memory consumption grows linearly with the number of the sequences N in the database.
The two auxiliary arrays consume (8 × a^k) byte
, with a
being the size of the amino acid alphabet (usually 21 including the unknown amino acid X) and the k-mer size k
.
MMseqs2 can run on multiple cores and servers using OpenMP (OMP) and message passing interface (MPI).
MPI assigns database splits to each servers and each server computes them using multiple cores (OMP).
Currently prefilter
, align
, result2profile
, swapresults
can take advantage of MPI.
To parallelize the time-consuming k-mer matching and gapless alignment stages prefilter
among multiple servers, two different modes are available. In the first, MMseqs2 can split the target sequence set into approximately equal-sized chunks, and each server searches all queries against its chunk. Alternatively, the query sequence set is split into equal-sized chunks and each server searches its query chunk against the entire target set. Splitting the target database is less time-efficient due to the slow, IO-limited merging of results. But it reduces the memory required on each server to (7 × N L/#chunks) byte + (a^k × 8) byte
and allows users to search through huge databases on servers with moderate memory sizes. If the number of chunks is larger than the number of servers, chunks will be distributed among servers and processed sequentially. By default, MMseqs2 automatically decides which mode to pick based on the available memory (assume that all machines have the same amount of memory).
Make sure that MMseqs2 was compiled with MPI by using the -DHAVE_MPI=1
flag (cmake -DHAVE_MPI=1 -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=. ..
). Our precomplied static version of MMseqs2 can not use MPI.
To search with multiple server just call the search and add the RUNNER variable. The TMP folder has to be shared between all nodes (e.g. NFS)
RUNNER="mpirun -np 42" mmseqs search queryDB targetDB resultDB tmp
For clustering just call the clustering. The TMP folder has to be shared between all nodes (e.g. NFS)
RUNNER="mpirun -np 42" mmseqs cluster DB clu tmp