- Grigorjew, A., Gynter, A., Dias, F.H. et al. Sensitive inference of alignment-safe intervals from biodiverse protein sequence clusters using EMERALD. Genome Biol 24, 168 (2023). https://doi.org/10.1186/s13059-023-03008-6
All datasets, including the 396k protein sequences, DIAMOND-clusters, EMERALD-output and annotation files used in the EMERALD paper can be retrieved from figshare: https://doi.org/10.6084/m9.figshare.21720299.
The file motifs.ipynb
contains reporoducible scripts. An erratum to the paper was published.
- Emerald https://github.com/algbio/emerald
- Diamond https://github.com/bbuchfink/diamond
- Muscle http://www.drive5.com/muscle/
- ete3 https://github.com/etetoolkit/ete
- Hmmer https://github.com/EddyRivasLab/hmmer
Compile Diamond from source (version 2.0.8 does not have issues with clustering):
wget https://github.com/bbuchfink/diamond/archive/refs/tags/v2.0.8.tar.gz
tar -xvzf v2.0.8.tar.gz
cd diamond-2.0.8
mkdir bin
cd bin
module load GCC CMake # Only on computer cluster
cmake -DEXTRA=ON ..
make
Compile Hmmer and easel from source:
git clone https://github.com/EddyRivasLab/hmmer
cd hmmer
git clone https://github.com/EddyRivasLab/easel
module load Autoconf # Only on computer cluster
autoconf
./configure
make
make check # optional: run automated tests
cd easel
make
Download muscle: http://www.drive5.com/muscle/downloads.htm wget http://www.drive5.com/muscle/downloads3.8.31/muscle3.8.31_i86linux64.tar.gz tar -xvzf v2.0.8.tar.gz
ete3 and snakemake via Conda environment
cd alignment-safety/pipeline
conda env create -f environment.yaml
conda activate pipeline
(optional) Compile Raxml-ng from source:
git clone https://github.com/amkozlov/raxml-ng
cd raxml-ng
mkdir build
cd build
cmake ..
make
(optional) Follow instruction for MMseqs2 installation (compilation from source recommended for better performance): https://github.com/soedinglab/MMseqs2#installation
db_file
Path to the protein sequence database (fasta format), e.g.uniprot_sprot.fasta
.family_db
Path to the Pfam protein domain database:Pfam-A.seed
.diamond
Path to DIAMOND, e.g../../diamond-2.08
.raxml
Path to RaxML-ng, e.g../../raxml-ng
.muscle
Path to Muscle - multiple sequence alignment tool, e.g../../muscle
.hmmer
Path to the hmmer programs, e.g../../hmmer-3.3.2
.safety
Parent folder of the compiled safety-window program,./../safety-windows
min_identity
Identity threshhold-% minimum to report and alignment. Used in db clustering process. Tested 20%-90%.sensitivity
auto
, or any of the DIAMOND sensitivity options: https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#sensitivity-modesclustering_algorithm
mcl
(DIAMOND),multi-step
(DIAMOND) ormmseqs
(MMSeqs2).clustering_min_size
Treshhold of the minimum cluster size to include.
Warning:< 20
, will produce large amount of files.clustering_max_size
Treshhold of the maximum cluster size to include.cluster_number
If less than available clusters,cluster_number
amount of clusters will be chosen randomly to include.
Speeds up debugging/testing.ref_criterion
--clustering
- default, depends on clustering: mcl or multi-step--identity
- Highest mean pair-wise identity score--highlow
- Highest lowest pair-wise identity score--similarity
- Highest hmmsearch score (i.e. most similar sequence to the MSA of the cluster)--taxonomy
- Highest node in taxonomic tree
snakemake -j n separate_clusters
n
Number of parallel processes. Clusters the database and separates clusters satisfying the treshholds toWORK_DIR/
WORK_DIR/fasta/
Each fasta-file corresponds to one cluster. Name of the file as well as the first sequence in the file is the reference sequence of that cluster.WORK_DIR/clean/
Same asfasta/
, but fasta sequences are cleaned with no additional information, such as taxonomic id. Needed for some programs, such as Muscle.WORK_DIR/refs/
One file containing each cluster's reference sequence in fasta-format. Needed forphmmer
.
snakemake -j n rule
all
Runs rules:safe
,identity
,hmmsearch
,hmmscan
,phmmer
and their prerequisites.safe
Runs Safety-Window-program on all clusters. Outputs toWORK_DIR/safety/
.identity
Runsesl-alipid
-program on all clusters to calculate pairwise identities. Outputs toWORK_DIR/id/
.hmmsearch
Runshmmsearch
on all clusters. Outputs toWORK_DIR/hmmsearch/
.hmmscan
Runshmmscan
on all clusters. Outputs toWORK_DIR/hmmscan/
.phmmer
Runsphmmer
on all clusters. Outputs toWORK_DIR/phmmer/
.
wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
bash Mambaforge-$(uname)-$(uname -m).sh
- install into
/proj/<username>/mamba
- Dependencies should be located in
/proj/<username>/
($PROJ) - Data such as Swissprotein and Pfam DB should be located somewhere in
/wrk-vakka/users/<username>
($WRKDIR) - Work (wrkdir) and temporary (tempdir) directory should be located in
/wrk-vakka/users/<username>
($WRKDIR)
turso/run.sh rule -j <num_of_maximum_parallel_processes>
- For clustering -j 56 should conclude in 30-60min
- Separating and changing reference -j 1-4
- To stream the progress
less +F logs/latest/progress.log
ctrl + c
thenq
exits the stream. Pipeline is still running in background.- Logs are found in
logs/<datetime>/<job_id>.out
slurm w q
to see running jobsslurm w qq
to see running jobs with resource usagescancel -M ukko2 <job_id>
to cancel jobseff -M ukko2 <job_id>
to see used resources and run time of jobsqueue -o '%A %.28R %j' -u <username>
to see if you have any jobs running
It's not possible to run rules, such as, all
, msa
, safe
, identity
, hmmsearch
, hmmscan
, phmmer
before separating clusters separate_clusters
- This is due to that Snakemake will look for files in
fasta/
andclean/
to execute these rules. Before separating clusters there are no files in those folders.
- This is due to disk quota limits on Turso. Ete3 tries to download taxonomy database into
~
which is not meant for data storage. This exceeds disk quota limit and is interrupted. - Workaround solution to fix this is to run Ete3 on your local machine and copy contents of
~/.etetoolkit
on your local machine into $WKRDIR (e.g./wrk-vakka/users/<username>/ncbi
) and make symbolic linkln -s /wrk-vakka/users/<username>/ncbi ~/.etetoolkit
- Use Diamond v2.0.8 for now
- If working on cluster download and extract somewhere in $WRKDIR
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gzip -d uniprot_sprot.fasta.gz
- Add path to parameters.yaml and/or turso/parameters.yaml
- If working on cluster download and extract somewhere in $WRKDIR
wget http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.seed.gz
gzip -d Pfam-A.seed.gz
- Add path to parameters.yaml and/or turso/parameters.yaml
- University of Helsinki HPC user guide: https://wiki.helsinki.fi/display/it4sci/HPC+Environment+User+Guide
- Hmmer documentation: http://eddylab.org/software/hmmer/Userguide.pdf
- Guide for slurm and other useful documentation (everything might not be applicable for Turso) https://scicomp.aalto.fi/