(C) Chengze Shen
EMMA is an ensemble usage of MAFFT --add
(particularly, MAFFT
with the -linsi
option) on large datasets. On the MAFFT webpage, MAFFT-linsi --add
is accurate for adding sequences to an existing alignment but is only recommended for a few hundred of sequences. This project aims to scale MAFFT-linsi --add
to run on large datasets with hundreds of thousands of sequences with similar (and sometimes better) alignment accuracy.
- (NEW) Checkpoint system! Now you can resume from any point if a previous run was interrupted somehow (except for the HMMSearch step, currently in implementation).
- (NEW) Now automatically detects input data type/molecule (
amino
,dna
, orrna
). - (NEW) Now has a progress bar for all intermediate steps (for better progress tracking!).
Add checkpoint support.Still need HMMSearch step checkpoint system.Add more customizable configuration support as WITCH.Finish up the pipeline so it supports building an alignment from scratch and not relying on UPP output.
Given an input existing alignment
-
Construct a set of constraint sub-alignments from
$C$ : Decompose$C$ to sub-alignments using the UPP decomposition strategy but limit to sub-alignments with$|Q_i|$ sequences,$l\leq |Q_i|\leq u$ ($l,u$ are user-provided free parameters; default values are$l=10,u=25$ ). This step creates a set of subsets that can overlap. -
Define the set of sub-problems: Assign each query sequence
$q\in Q$ to the best-fitting sub-alignment from Step 1. The assignment is determined by first constructing HMMs on the sub-alignments and then selecting the HMM with the highest adjusted bitscore (see WITCH) for each$q$ . -
Run
MAFFT-linsi--add
on each sub-problem: For each sub-problem (i.e., a sub-alignment$C_i$ on set$S_i$ with assigned query sequences$Q_i$ ), construct an extended sub-alignment on$S_i\cup Q_i$ usingMAFFT-linsi--add
. -
Merge extended sub-alignments using transitivity: All extended sub-alignments are consistent with each other (see proof in the main paper) and can merge to the backbone alignment with transitivity (see SEPP/UPP). The merging produces the final alignment on
$S\cup Q$ .
- Currently accepted in WABI 2023.
- Currently published on Algorithms of Molecular Biology (https://doi.org/10.1186/s13015-023-00247-x).
EMMA was tested and benchmarked on the following systems:
- Red Hat Enterprise Linux Server release 7.9 (Maipo) with Python 3.7.0
- Ubuntu 22.04 LTS with Python 3.7.12
EMMA requires the usage of MAFFT
binaries. One is provided with the package (v7.490 2021/Oct/30), but the MAFFT
binaries in the user's $PATH
environment are prioritized. If you experience any difficulties running EMMA, please contact Chengze Shen (chengze5@illinois.edu).
python>=3.7
configparser>=5.0.0
dendropy>=4.5.2,<4.6.0
numpy>=1.15
psutil>=5.0
scipy>=1.1.0
tqdm>=4.0.0
# 1. Install via GitHub repo
git clone https://github.com/c5shen/EMMA.git
# 2. Install all requirements
cd EMMA
pip3 install -r requirements.txt
# 3. Use emma.py, -h to see allowed commandline parameters
python3 emma.py [-h]
Scripts of the following examples can be found in example/run.sh
. You can run each scenario with
./run.sh [i] # i can be 1, 2, or 3
python3 emma.py -b [input alignment] -e [input tree] -q [unaligned sequences] -d [output directory] -o est.aln.fasta
python3 emma.py -b [input alignment] -q [unaligned sequences] -d [output directory] -o est.aln.fasta
# > the "backbone sequences" will be selected from inputs and aligned with default MAGUS
# > a tree will be created for the backbone alignment using FastTree2
python3 emma.py -i [input sequences] -d [output directory] -o est.aln.fasta