Roary ILP Bacterial core Annotation Pipeline.
Annotate genes in your bacterial genomes with Prokka and determine a pangenome with the great Roary. The initial gene clusters found by Roary are further refined with the usage of ILPs that solve the best matching for each pairwise strain MMseqs2 comparison.
- Overview
- Quick start
- Execution examples
- Example output
- Install
- Runtime and disk space
- Limitations
- Publication
- References
A common task when you have a bunch of bacterial genomes at hand is calculating a core gene set or pangenome. We want to know, which genes are homologs and shared between a set of bacteria. However, defining homology only based an sequence similarity often underestimates the true core gene set, particularly when diverse species are compared. RIBAP combines sequence homology information from Roary with smart pairwise ILP calculations to produce a more complete core gene set - even on genus level. First, RIBAP performs annotations with Prokka, calculates the core gene set using Roary and pairwise ILPs, and finally visualizes the results in an interactive HTML table garnished with protein multiple sequence alignments and trees. RIBAP comes with Nextflow and Docker/Singularity/Conda options to resolve all necessary software dependencies for easy execution.
You just need a working nextflow
and docker
or singularity
or conda
installation, see below! We recommand the usage of containers! You have nextflow
and docker
? Give it a try:
nextflow pull hoelzer-lab/ribap
nextflow info hoelzer-lab/ribap
# select a release version, e.g. 1.0.0
nextflow run hoelzer-lab/ribap -r 1.0.0 --fasta "$HOME/.nextflow/assets/hoelzer-lab/ribap/data/*.fasta" -profile local,docker
You have nextflow
and conda
? Okay, just change the profile:
nextflow run hoelzer-lab/ribap -r 1.0.0 --fasta "$HOME/.nextflow/assets/hoelzer-lab/ribap/data/*.fasta" -profile local,conda
You need some of these dependencies? Check the end of this README!
# Get or update the workflow:
nextflow pull hoelzer-lab/ribap
# Run a specific release, check whats available:
nextflow info hoelzer-lab/ribap
# Or get newest release version automatically and show the help:
REVISION=$(nextflow info hoelzer-lab/ribap | sed 's/ [*]//' | sed 's/ //g' | sed 's/\[t\]//g' | awk 'BEGIN{FS=" "};{if($0 ~ /^ *[0-9]/){print $0}}' | sort -Vr | head -1)
nextflow run hoelzer-lab/ribap -r $REVISION --help
# Run with IQ-TREE tree calculation using all identified core genes and specified output dir.
# Attention: less than 1000 bootstrap replicates (default) can not be used.
# For intermediate files, a work dir via -w is defined.
# Note, that Nextflow build-in parameters use a single dash "-" symbol.
nextflow run hoelzer-lab/ribap -r $REVISION --fasta '*.fasta' --tree --bootstrap 1000 --outdir ~/ribap -w ribap-work
# The ILPs can take a lot (!) of space. But you can use this flag to keep them anyway. Use with caution!
nextflow run hoelzer-lab/ribap -r $REVISION --fasta '*.fasta' --outdir ~/ribap -w ribap-work --keepILPs
# Run with optional reference GenBank file to guide Prokka annotation.
# ATTENTION: this will use the additional reference file for every input genome!
nextflow run hoelzer-lab/ribap -r $REVISION --fasta '*.fasta' --reference GCF_000007205.1_ASM720v1_genomic.gbff --outdir ~/ribap -w work
# Use list parameter to provide genome FASTAs and corresponding reference GenBank files in CSV format.
nextflow run hoelzer-lab/ribap -r $REVISION --list --fasta genomes.csv --reference refs.csv --outdir ~/ribap -w work
# genomes.csv:
#
# genome1,genome1.fasta
# genome2,genome2.fasta
# genome3,genome3.fasta
# refs.csv
#
# genome1,ref.gbff
# genome2,ref.gbff
#
# Here, genome1 and genome2 will additionally use information from ref.gbff in Prokka annotation while genome3 will be annotated w/o additional reference information.
RIBAP stores all important output in --output results
. Besides CDS annotations, alignments, trees, and the core gene groups one of the main output files is an interactive HTML table that summarizes the identified RIBAP groups and can be explored by the user:
Further output examples for a small test set composed of five Chlamydia genomes can be found here.
- runs with the workflow manager
nextflow
usingdocker
orsingularity
orconda
- this means, all programs are automatically pulled via
docker
,singularity
, orconda
- only
docker
orsinularity
orconda
andnextflow
need to be installed (per defaultconda
is used)
Attention Use the below installation examples with causion and check you system requirements. Its best to refer to the specific manual of each software.
Needed in all cases (conda
, docker
, singularity
)
sudo apt-get update
sudo apt install -y default-jre
curl -s https://get.nextflow.io | bash
sudo mv nextflow /bin/
Just copy the commands and follow the installation instructions. Let the installer configure conda
for you. You need to specify -profile conda
to run the pipeline with conda
support. Note, internally the pipeline will switch to mamba
to faser resolve all environments.
cd
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
See here if you need a different installer besides Linux used above.
If you dont have experience with bioinformatic tools just copy the commands into your terminal to set everything up:
sudo apt-get install -y docker-ce docker-ce-cli containerd.io
sudo usermod -a -G docker $USER
- restart your computer
- try out the installation by entering the following
Docker installation details here
Attention: RIBAP is currently not intended to be used with hundreds or thousands of input genomes (see also Limitations). Also for smaller input sets, you will need quite some disk space to store the ILPs and their results. Due to that, RIBAP is per default not storing the intermediate ILP files and solutions. You can use --keepILPs
to store them, if needed. Below, we report runtime and disk space usage on a regular Linux laptop (8 CPUs used) and a HPC (SLURM, pre-configured CPU and RAM usage, runtime may be biased due to HPC work load) using varying numbers of Chlamydia psittaci genomes (~1 Mbp, originally sampled from here and also available here) as input and with and without the --keepILPs
option. Please also not that we used the default --chunks 8
value for local and HPC execution. Especially on a HPC, the runtime can be increased drastically by increasing the --chunks
value (more parallel ILP solving).
CPUs | #genomes (1 Mbp) | --keepILPs |
time | work space |
output space |
---|---|---|---|---|---|
8 | 8 | NO | 24 min | 1.2 GB | 195 MB |
8 | 8 | YES | 23 min | 5.5 GB | 195 MB |
8 | 16 | NO | 1 h 4 min | 1.5 GB | 398 MB |
8 | 16 | YES | 1 h | 18 GB | 398 MB |
8 | 32 | NO | 3 h 10 min | 2.2 GB | 788 MB |
8 | 32 | YES | 3 h 16 min | 84 GB | 789 MB |
HPC | 8 | NO | 6 min | 1.2 GB | 196 MB |
HPC | 8 | YES | 6 min | 5.5 GB | 196 MB |
HPC | 16 | NO | 17 min | 1.5 GB | 400 MB |
HPC | 16 | YES | 16 min | 21 GB | 400 MB |
HPC | 32 | NO | 1 h 1 min | 2.2 GB | 794 MB |
HPC | 32 | YES | 1 h 11 min | 83 GB | 794 MB |
Commands used (in slight variations):
# laptop
nextflow run hoelzer-lab/ribap -r 1.0.0 --fasta '*.fasta' --cores 8 --max_cores 8 -profile local,docker -w work --output ribap-results --keepILPs
# HPC
nextflow run hoelzer-lab/ribap -r 1.0.0 --fasta '*.fasta' -profile slurm,singularity -w work --output ribap-results --keepILPs
- The current version of RIBAP is not intended for use with hundreds of input genomes. You can try, but expect a long runtime and high memory requirements (see Runtime and disk space). We recommend running RIBAP with less than 100 input genomes and without the
--keepILPs
parameter to save space. - RIBAP's combination of Roary clusters and ILPs specifically calculates core gene sets for more diverse species. If your input genomes are from the same species, you can also try RIBAP, but you may be better off with tools like Roary, Panaroo, or PPanGGOLiN.
- Currently, we do not classify the final RIBAP groups into pangenome categories, such as core and accessory genomes or persistent/shell/cloud. So RIBAP is very useful to get a good idea of a comprehensive set of core genes for different species, and also of other groups that cover only subsets of the input genomes without calling them "accessory" or "shell"/"cloud." Still, tools like the ones above can help you better classify genes into these categories if needed.
If you find RIBAP useful please cite it in your work:
Lamkiewicz, K., Barf, LM. & Hölzer, M. (hopefully in 2023!). Pangenome calculation beyond the species level with RIBAP: A comprehensive bacterial core genome annotation pipeline based on Roary and pairwise ILPs . In prep!
However, RIBAP would not be possible without the help of many great open source bioinformatics tools. Kudos to all the developers and maintainers! Please cite them in your work as well!
In particular, RIBAP takes advantage of and uses the following tools:
Page, Andrew J., et al. "Roary: rapid large-scale prokaryote pan genome analysis." Bioinformatics 31.22 (2015): 3691-3693.
Seemann, Torsten. "Prokka: rapid prokaryotic genome annotation." Bioinformatics 30.14 (2014): 2068-2069.
Steinegger, Martin, and Johannes Söding. "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets." Nature biotechnology 35.11 (2017): 1026-1028.
Martinez FV, Feijão P, Braga MD, Stoye J. "On the family-free DCJ distance and similarity" Algorithms Mol Biol. 2015 Apr 1;10(1):13
Makhorin, Andrew. "GLPK (GNU linear programming kit)." http://www. gnu. org/s/glpk/glpk. html (2008).
Katoh, Kazutaka, and Daron M. Standley. "MAFFT multiple sequence alignment software version 7: improvements in performance and usability." Molecular biology and evolution 30.4 (2013): 772-780.
Price, Morgan N., Paramvir S. Dehal, and Adam P. Arkin. "FastTree 2 - approximately maximum-likelihood trees for large alignments." PloS one 5.3 (2010): e9490.
Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu and Weizhong Li. "CD-HIT: accelerated for clustering the next generation sequencing data." Bioinformatics, (2012), 28 (23): 3150-3152
Minh, Bui Quang, et al. "IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era." Molecular biology and evolution 37.5 (2020): 1530-1534.
Conway, Jake R., Alexander Lex, and Nils Gehlenborg. "UpSetR: an R package for the visualization of intersecting sets and their properties." Bioinformatics (2017).