PHIStruct (Phage-Host Interaction Prediction with Structure-Aware Protein Embeddings)

PHIStruct is a phage-host interaction prediction tool that uses structure-aware protein embeddings to represent the receptor-binding proteins (RBPs) of phages. By incorporating structure information, it presents improvements over using sequence-only protein embeddings and feature-engineered sequence properties — especially for phages with RBPs that have low sequence similarity to those of known phages.

Preprint: https://doi.org/10.1101/2024.08.24.609479

If you find our work useful, please consider citing:

@article {PHIStruct,
    author = {Gonzales, Mark  Edward M. and Ureta, Jennifer C. and Shrestha, Anish M.S.},
    title = {PHIStruct: Improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings},
    elocation-id = {2024.08.24.609479},
    year = {2024},
    doi = {10.1101/2024.08.24.609479},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2024/08/24/2024.08.24.609479},
    eprint = {https://www.biorxiv.org/content/early/2024/08/24/2024.08.24.609479.full.pdf},
    journal = {bioRxiv}
}

♾️ Run on Google Colab

You can readily run PHIStruct on Google Colab, without the need to install anything on your own computer: https://bit.ly/PHIStruct

🚀 Installation & Usage

Operating System: Windows (using WSL), Linux, or macOS

Clone the repository:

git clone https://github.com/bioinfodlsu/PHIStruct
cd PHIStruct

Create a virtual environment with the dependencies installed via Conda (we recommend using Miniconda):

conda env create -f environment.yaml

Activate this environment by running:

conda activate PHIStruct

Depending on your operating system, run the correct installation command (refer to the last column of the table below) to install and configure the remaining dependencies (you only need to do this once, that is, at installation):

OS/Build	Command for Checking OS/Build	Installation Command
Linux AVX2 Build	`cat /proc/cpuinfo \| grep avx2`	`bash init.sh avx2`
Linux SSE2 Build	`cat /proc/cpuinfo \| grep sse2`	`bash init.sh sse2`
Linux ARM64 Build	`dpkg --print-architecture` or `uname -m`	`bash init.sh arm64`
macOS	–	`bash init.sh osx`

Note: Running the init.sh script may take a few minutes since it involves downloading a model (SaProt, around 5 GB) from Hugging Face.

Running PHIStruct

python3 phistruct.py --input <input_dir> --model <model_joblib> --output <results_dir>

Replace <input_dir> with the path to the directory storing the PDB files describing the structures of the receptor-binding proteins. Sample PDB files are provided here.
Replace <model_joblib> with the path to the trained model (recognized format: joblib or compressed joblib, framework: scikit-learn). Download our trained model from this link. No need to uncompress, but doing so will speed up loading the model albeit at the cost of additional storage requirements. Refer to this guide for the list of accepted compressed formats.
Replace <results_dir> with the path to the directory to which the results of running PHIStruct will be written. The results of running PHIStruct on the sample PDB files are provided here.

The results for each protein are written to a CSV file (without a header row). Each row contains two comma-separated values: a host genus and the corresponding prediction score (class probability). The rows are sorted in order of decreasing prediction score. Hence, the first row pertains to the top-ranked prediction.

Under the hood, this script first converts each protein into a structure-aware protein embedding using SaProt and then passes the embedding to a multilayer perceptron trained on all the entries in our dataset with host among the ESKAPEE genera (link). If your machine has a GPU, it will automatically be used to accelerate the protein embedding generation step.

Training PHIStruct

python3 train.py --input <training_dataset>

Replace <training_dataset> with the path to the training dataset. A sample can be downloaded here.

The training dataset should be formatted as a CSV file (without a header row) where each row corresponds to a training sample. The first column is for the protein IDs, the second column is for the host genera, and the next 1,280 columns are for the components of the SaProt embeddings.

This script will output a gzip-compressed, serialized version of the trained model with filename phistruct_trained.joblib.gz.

↑ Return to Table of Contents.

📚 Description

Motivation: Recent computational approaches for predicting phage-host interaction have explored the use of sequence-only protein language models to produce embeddings of phage proteins without manual feature engineering. However, these embeddings do not directly capture protein structure information and structure-informed signals related to host specificity.

Method: We present PHIStruct, a multilayer perceptron that takes in structure-aware embeddings of receptor-binding proteins, generated via the structure-aware protein language model SaProt, and then predicts the host from among the ESKAPEE genera.

Results: Compared against recent tools, PHIStruct exhibits the best balance of precision and recall, with the highest and most stable F1 score across a wide range of confidence thresholds and sequence similarity settings. The margin in performance is most pronounced when the sequence similarity between the training and test sets drops below 40%, wherein, at a relatively high-confidence threshold of above 50%, PHIStruct presents a 7% to 9% increase in class-averaged F1 over machine learning tools that do not directly incorporate structure information, as well as a 5% to 6% increase over BLASTp.

↑ Return to Table of Contents.

🔬 Dataset of Predicted Structures of Receptor-Binding Proteins

We also release a dataset of protein structures, computationally predicted via ColabFold, of 19,081 non-redundant (i.e., with duplicates removed) receptor-binding proteins from 8,525 phages across 238 host genera. We identified these receptor-binding proteins based on GenBank annotations. For phage sequences without GenBank annotations, we employed a pipeline that uses the viral protein library PHROG and the machine learning model PhageRBPdetect.

↑ Return to Table of Contents.

🧪 Reproducing Our Results

Project Structure

The experiments folder contains the files and scripts for reproducing our results. Note that additional (large) files have to be downloaded (or generated) following the instructions in the Jupyter notebooks.

Click here to show/hide the list of directories, Jupyter notebooks, and Python scripts, as well as the folder structure.

Directories

Directory	Description
`data`	Contains the data (including the FASTA files and embeddings)
`preprocessing`	Contains text files related to the preprocessing of host information and the identification of annotated receptor-binding proteins
`rbp_prediction`	Contains the trained model PhageRBPdetect (in JSON format), used for the computational prediction of receptor-binding proteins. Downloaded from this repository (under the MIT License)
`temp`	Contains intermediate output files during preprocessing, exploratory data analysis, and performance evaluation

Jupyter Notebooks

Notebook	Description
`1. Sequence Preprocessing.ipynb`	Preprocessing of host information and identification of annotated receptor-binding proteins
`2. RBP Computational Prediction.ipynb`	Computational prediction of receptor-binding proteins
`3.0. Data Consolidation (SaProt).ipynb` `3.1. Data Consolidation (ProstT5).ipynb` `3.2. Data Consolidation (PST).ipynb` `3.3. Data Consolidation (SaProt with Low-Confidence Masking).ipynb` `3.4. Data Consolidation (SaProt with Structure Masking).ipynb` `3.5. Data Consolidation (SaProt with Sequence Masking).ipynb`	Generation of CSV files consolidating the proteins, phage-host information, and embeddings
`4. Exploratory Data Analysis.ipynb`	Exploratory data analysis
`5.0. Classifier Building & Evaluation (SaProt).ipynb` `5.1. Benchmarking - Classifier Building & Evaluation (ProstT5).ipynb` `5.2. Benchmarking - Classifier Building & Evaluation (PST).ipynb` `5.3. Benchmarking - Classifier Building & Evaluation (ESM-1b).ipynb` `5.4. Benchmarking - Classifier Building & Evaluation (ESM-2).ipynb` `5.5. Benchmarking - Classifier Building & Evaluation (ProtT5).ipynb` `5.6. Benchmarking - Classifier Building & Evaluation (SaProt with Low-Confidence Masking).ipynb` `5.7. Benchmarking - Classifier Building & Evaluation (SaProt with Structure Masking).ipynb` `5.8. Benchmarking - Classifier Building & Evaluation (SaProt with Sequence Masking).ipynb`	Construction of phage-host interaction prediction model, benchmarking, and performance evaluation
`6.0. Comparison.ipynb` `6.1. Plotting - F1.ipynb` `6.2. Plotting - PR Curve.ipynb` `6.3. Confusion Matrix.ipynb`	Tabular and graphical comparison of the performance of our model versus benchmarks

Python Scripts

Script	Description
`ClassificationUtil.py`	Contains the utility functions for the constructing the training and test sets, building the phage-host interaction prediction model, and evaluating its performance
`ConstantsUtil.py`	Contains the constants used in the notebooks and scripts
`MLPDropout.py`	Implements a multilayer perceptron with dropout in scikit-learn
`RBPPredictionUtil.py`	Contains the utility functions for the computational prediction of receptor-binding proteins
`SequenceParsingUtil.py`	Contains the utility functions for preprocessing host information and identifying annotated receptor-binding proteins
`StructureUtil.py`	Contains the utility functions for consolidating the embeddings generated via structure-aware protein language models

Folder Structure

Once you have cloned this repository and finished downloading (or generating) all the additional required files following the instructions in the Jupyter notebooks, your folder structure should be similar to the one below:

PHIStruct (root)
- experiments
  - data
    - GenomesDB (Download and unzip)
      - AB002632
      - ...
    - inphared
      - consolidated (Download and unzip)
        
        rbp.csv
        
        ...
      - embeddings
        
        prottransbert (Download and unzip)
        
        complete
        
        hypothetical
        
        rbp
      - fasta (Download and unzip)
        
        complete
        
        hypothetical
        
        nucleotide
        
        rbp
      - structure
        
        pdb (Download and unzip)
        
        rbp_saprot_embeddings (Download and unzip)
        
        AAA74324.1_relaxed.r3.pdb.pt
        
        rbp_saprot_mask_embeddings (Download and unzip)
        
        AAA74324.1_relaxed.r3.pdb.pt
        
        rbp_saprot_seq_mask_embeddings (Download and unzip)
        
        AAA74324.1_relaxed.r3.pdb.pt
        
        rbp_saprot_struct_mask_embeddings (Download and unzip)
        
        AAA74324.1_relaxed.r3.pdb.pt
        
        rbp_pst_embeddings (Download and unzip)
        
        AAA74324.1_relaxed.r3.pdb.pt
        
        rbp_prostt5_embeddings.h5 (Download)
        
        rbp_saprot_mask_relaxed_r3.csv (Download)
        
        rbp_saprot_relaxed_r3.csv (Download)
        
        rbp_saprot_seq_mask_relaxed_r3.csv (Download)
        
        rbp_saprot_struct_mask_relaxed_r3.csv (Download)
        
        rbp_pst_relaxed_r3.csv (Download)
        
        rbp_prostt5_relaxed_r3.csv (Download)
    - 3Oct2023_data_excluding_refseq.tsv
    - 3Oct2023_phages_downloaded_from_genbank.gb (Download)
  - preprocessing
  - rbp_prediction
  - temp
  - 1. Sequence Preprocessing.ipynb
  - ...
  - ClassificationUtil.py
  - ...

↑ Return to Table of Contents.

Dependencies

Operating System: Windows (using WSL), Linux, or macOS

Create a virtual environment with the dependencies installed via Conda (we recommend using Miniconda):

conda env create -f environment_experiments.yaml

Activate this environment by running:

conda activate PHIStruct-experiments

↑ Return to Table of Contents.

💻 Authors

Mark Edward M. Gonzales
gonzales.markedward@gmail.com
Ms. Jennifer C. Ureta
jennifer.ureta@gmail.com
Dr. Anish M.S. Shrestha
anish.shrestha@dlsu.edu.ph

This is a research project under the Bioinformatics Laboratory, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Philippines.

This research was partly funded by the Department of Science and Technology Philippine Council for Health Research and Development (DOST-PCHRD) under the e-Asia JRP 2021 Alternative therapeutics to tackle AMR pathogens (ATTACK-AMR) program.

This research was supported with Cloud TPUs from Google's TPU Research Cloud (TRC) and with computing resources from the Machine Learning eResearch Platform (MLeRP) of Monash University, University of Queensland, and Queensland Cyber Infrastructure Foundation Ltd.