phold - phage annotation using protein structures

phold is a sensitive annotation tool for bacteriophage genomes and metagenomes using protein structural homology.

phold uses the ProstT5 protein language model to translate protein amino acid sequences to the 3Di token alphabet used by Foldseek. Foldseek is then used to search these against a database of 803k protein structures mostly predicted using Colabfold.

Alternatively, you can specify protein structures that you have pre-computed for your phage(s) instead of using ProstT5.

Benchmarking is ongoing but phold strongly outperforms Pharokka, particularly for less characterised phages such as those from metagenomic datasets.

If you have already annotated your phage(s) with Pharokka, phold takes the Genbank output of Pharokka as an input option, so you can easily update the annotation with more functional predictions!

Tutorial

Check out the phold tutorial at https://phold.readthedocs.io/en/latest/tutorial/.

phold - phage annotation using protein structures
Tutorial
Table of Contents
Documentation
Installation
Quick Start
Output
Usage
Plotting
Citation

Documentation

Check out the full documentation at https://phold.readthedocs.io.

Installation

The only way to install phold is from source for now.

PyPI and conda installations will be available soon.

The only required non-Python dependency is foldseek. To install phold in a conda environment using mamba:

mamba create -n pholdENV -c conda-forge -c bioconda pip foldseek python=3.11
conda activate pholdENV
git clone https://github.com/gbouras13/phold.git
cd phold 
pip install -e .

To utilise phold with GPU, a GPU compatible version of pytorch must be installed.

If it is not automatically installed via the pip installation, please see this link for more instructions on how to install pytorch. If you have an older version of CUDA installed, then you might find this link useful.

Once phold is installed, to download and install the database run:

phold install

Note: You will need at least 8GB of free space (the phold databases including ProstT5 are 7.7GB uncompressed).

Quick Start

phold takes a GenBank format file output from pharokka as its input by default.
If you are running phold on a local work station with GPU available, using phold run is recommended. It runs both phold predict and phold compare

phold run -i tests/test_data/NC_043029.gbk  -o test_output_phold -t 8

If you do not have a GPU available, add --cpu
phold run will run in a reasonable time for small datasets with CPU only (e.g. <5 minutes for a 50kbp phage).
However, phold predict will complete much faster if a GPU is available, and is necessary for large metagenomic datasets to run in a reasonable time.
In a cluster environment, it is most efficient to run phold in 2 steps for optimal resource usage.

Predict the 3Di sequences with ProstT5 using phold predict. This is massively accelerated if a GPU available.

phold predict -i tests/test_data/NC_043029.gbk -o test_predictions

Compare the the 3Di sequences to the phold structure database with Foldseek using phold compare. This does not utilise a GPU.

phold compare -i tests/test_data/NC_043029.gbk --predictions_dir test_predictions -o test_output_phold -t 8

Output

The primary outputs are:
- phold_3di.fasta containing the 3Di sequences for each CDS
- phold_per_cds_predictions.tsv containing detailed annotation information on every CDS
- phold_all_cds_functions.tsv containing counts per contig of CDS in each PHROGs category, VFDB, CARD, ACRDB and Defensefinder databases (similar to the pharokka_cds_functions.tsv from Pharokka)
- phold.gbk, which contains a GenBank format file including these annotations, and keeps any other genomic features (tRNA, CRISPR repeats, tmRNAs) included from the pharokka Genbank input file if provided

Usage

Usage: phold [OPTIONS] COMMAND [ARGS]...

Options:
  -h, --help     Show this message and exit.
  -V, --version  Show the version and exit.

Commands:
  citation          Print the citation(s) for this tool
  compare           Runs Foldseek vs phold db
  createdb          Creates foldseek DB from AA FASTA and 3Di FASTA input...
  predict           Uses ProstT5 to predict 3Di tokens - GPU recommended
  proteins-compare  Runs Foldseek vs phold db on proteins input
  proteins-predict  Runs ProstT5 on a multiFASTA input - GPU recommended
  remote            Uses foldseek API to run ProstT5 then foldseek locally
  run               phold predict then comapare all in one - GPU recommended

Usage: phold run [OPTIONS]

  phold predict then comapare all in one - GPU recommended

Options:
  -h, --help                Show this message and exit.
  -V, --version             Show the version and exit.
  -i, --input PATH          Path to input file in Genbank format or nucleotide
                            FASTA format  [required]
  -o, --output PATH         Output directory   [default: output_phold]
  -t, --threads INTEGER     Number of threads  [default: 1]
  -p, --prefix TEXT         Prefix for output files  [default: phold]
  -d, --database TEXT       Specific path to installed phold database
  -f, --force               Force overwrites the output directory
  --batch_size INTEGER      batch size for ProstT5. 1 is usually fastest.
                            [default: 1]
  --cpu                     Use cpus only.
  --omit_probs              Do not output 3Di probabilities from ProstT5
  --finetune                Use finetuned ProstT5 model
  --finetune_path TEXT      Path to finetuned model weights
  -e, --evalue FLOAT        Evalue threshold for Foldseek  [default: 1e-3]
  -s, --sensitivity FLOAT   sensitivity parameter for Foldseek  [default: 9.5]
  --keep_tmp_files          Keep temporary intermediate files, particularly
                            the large foldseek_results.tsv of all Foldseek
                            hits
  --split                   Split the Foldseek searches by ProstT5 probability
  --split_threshold FLOAT   ProstT5 probability to split by  [default: 60]
  --card_vfdb_evalue FLOAT  Stricter Evalue threshold for Foldseek CARD and
                            VFDB hits  [default: 1e-10]
  --separate                Output separate genbank files for every contig
  --max_seqs INTEGER        Maximum results per query sequence allowed to pass
                            the prefilter. You may want to reduce this to save
                            disk space for enormous datasets  [default: 1000]

Plotting

phold plot will allow you to create Circos plots with pyCirclize for all your phage(s). For example:

phold plot -i tests/test_data/NC_043029_phold_output.gbk  -o NC_043029_phold_plots -t '${Stenotrophomonas}$ Phage SMA6'

Citation

phold is a work in progress, a preprint will be coming hopefully soon - if you use it please cite the GitHub repository https://github.com/gbouras13/phold for now.

Please be sure to cite the following core dependencies and PHROGs database:

Please also consider citing these supplementary databases where relevant:

Vini2/phold