/TopoFormer

Topological transformer for protein-ligand complex interaction prediction.

Primary LanguagePythonMIT LicenseMIT

TopoFormer

ARTICLE License: MIT DOI

Title - Multiscale Topology-enabled Structure-to-Sequence Transformer for Protein-Ligand Interaction Predictions.

Authors - Dong Chen, Jian Liu, and Guo-wei Wei


Table of Contents


Introduction

Topological Transformer (TopoFormer) is built by integrating NLP and a multiscale topology techniques, the persistent topological hyperdigraph Laplacian (PTHL), which systematically converts intricate 3D protein-ligand complexes at various spatial scales into a NLP-admissible sequence of topological invariants and homotopic shapes. Element-specific PTHLs are further developed to embed crucial physical, chemical, and biological interactions into topological sequences. TopoFormer surges ahead of conventional algorithms and recent deep learning variants and gives rise to exemplary scoring accuracy and superior performance in ranking, docking, and screening tasks in a number of benchmark datasets. The proposed topological sequences can be extracted from all kinds of structural data in data science to facilitate various NLP models, heralding a new era in AI-driven discovery. Keywords: Drug design, Topological sequences, Topological Transformer, Multiscale Topology, Hyperdigraph Laplacian.


Model Architecture

Model Architecture

Further explain the details in the paper, providing context and additional information about the architecture and its components.


Getting Started

Prerequisites

  • transformers 4.24.0
  • numpy 1.21.5
  • scipy 1.7.3
  • pytorch 1.13.1
  • pytorch-cuda 11.7
  • scikit-learn 1.0.2
  • python 3.9.12

Installation

git clone https://github.com/WeilabMSU/TopoFormer.git

Datasets

Datasets Training Set Test Set
Pre-training Combind PDBbind 19513 RowData, TopoFeature_small, TopoFeature_large
Finetuning CASF-2007 1105 Label 195 Label
CASF-2013 2764 Label 195 Label
CASF-2016 3772 Label 285 Label
PDB v2016 3767 Label 290 Label
PDB v2020 18904 Label
(exclude core sets)
195 Label
(CASF-2007 core set)
195
(CASF-2013 core set)
285
(CASF-2016 core set)
285
(v2016 core set)
  • RowData: the protein-ligand complex structures. From PDBbind
  • TopoFeature: the topological embedded features for the protein-ligand complex. All features are saved in a dict, which key is the protein ID, and value is the topological embedded features for corresponding complex. The downloaded file is .zip file, which contains two file (1) TopoFeature_large.npy: topological embedded features with a filtration parameter ranging from 0 to 10 and incremented in steps of 0.1 \AA; (2) TopoFeature_small.npy: topological embedded features with a filtration parameter ranging from 2 to 12 and incremented in steps of 0.2 \AA;
  • Label: the .csv file, which contains the protein ID and corresponding binding affinity in the logKa unit.
Task Datasets Description
Screening LIT-PCBA 3D poses for all 15 targets. Download (13GB)
PDBbind-v2013 3D poses. Download from https://weilab.math.msu.edu/AGL-Score
Docking CASF-2007,2013 3D poses. Download from https://weilab.math.msu.edu/AGL-Score

Usage

Preparing Topologial Sequence

# get the usage
python ./code_pkg/main_potein_ligand_topo_embedding.py -h

# examples
python ./code_pkg/main_potein_ligand_topo_embedding.py --output_feature_folder "../examples/output_topo_seq_feature_result" --protein_file "../examples/protein_ligand_complex/1a1e/1a1e_pocket.pdb" --ligand_file "../examples/protein_ligand_complex/1a1e/1a1e_ligand.mol2" --dis_start 0 --dis_cutoff 5 --consider_field 20 --dis_step 0.1

Fine-Tuning Procedure for Customized Data

bs=32 # batch size
lr=0.00008  # learning rate
ms=10000  # max training steps
fintuning_python_script=./code_pkg/topt_regression_finetuning.py
model_output_dir=./outmodel_finetune_for_regression
mkdir $model_output_dir
pretrained_model_dir=./pretrained_model
scaler_path=./code_pkg/pretrain_data_standard_minmax_6channel_large.sav
validation_data_path=./CASF_2016_valid_feat.npy
train_data_path=./CASF_2016_train_feat.npy
validation_label_path=./CASF2016_core_test_label.csv
train_label_path=./CASF2016_refine_train_label.csv

# finetune for regression on one GPU
CUDA_VISIBLE_DEVICES=1 python $fintuning_python_script --hidden_dropout_prob 0.1 --attention_probs_dropout_prob 0.1 --num_train_epochs 100 --max_steps $ms --per_device_train_batch_size $bs --base_learning_rate $lr --output_dir $model_output_dir --model_name_or_path $pretrained_model_dir --scaler_path $scaler_path --validation_data $validation_data_path --train_data $train_data_path --validation_label $validation_label_path --train_label $train_label_path --pooler_type cls_token --random_seed 1234 --seed 1234;
# script for no validation data and validation label
# docking and screening
bs=32 # batch size
lr=0.0001  # learning rate
ms=5000  # max training steps
fintuning_python_script=./code_pkg/topt_regression_finetuning_docking.py
model_output_dir=./outmodel_finetune_for_docking
mkdir $model_output_dir
pretrained_model_dir=./pretrained_model
scaler_path=./code_pkg/pretrain_data_standard_minmax_6channel_filtration50-12.sav
train_data_path=./train_feat.npy
train_label_path=./train_label.csv
train_valdation_split=0.1  # 1/10 of training data will be used for validation

# finetune for regression on one GPU
CUDA_VISIBLE_DEVICES=1 python $fintuning_python_script --hidden_dropout_prob 0.1 --attention_probs_dropout_prob 0.1 --num_train_epochs 100 --max_steps $ms --per_device_train_batch_size $bs --base_learning_rate $lr --output_dir $model_output_dir --model_name_or_path $pretrained_model_dir --scaler_path /$scaler_path --train_data $train_data_path --train_label $train_label_path --validation_data None --validation_label None --train_val_split 0.1 --pooler_type cls_token --random_seed 1234 --seed 1234 --specify_loss_fct 'huber';

Extract the Latent Features

# replace with the proper pathes
model_path=./pretrained_model
feature_path=./topo_feature.npy  # it contains the topo_feature_array, rather than the dict
scaler_path=./code_pkg/pretrain_data_standard_minmax_6channel_filtration50-12.sav
save_feature_path=./latent_feature.npy
latent_python_script=./code_pkg/final_generate_latent_features.py

python $latent_python_script --model_path $model_path --scaler_path $scaler_path --feature_path $feature_path --save_feature_path $save_feature_path --latent_type encoder_pretrain

Results

Pretrained models

  • Pretrained TopoFormer model large. Download
  • Pretrained TopoFormer model small. Download

Finetuned models and performances

  • Scoring
Finetuned for scoring Training Set Test Set PCC RMSE (kcal/mol)
CASF-2007 result 1105 195 0.837 1.807
CASF-2007 small result 1105 195 0.839 1.807
CASF-2013 result 2764 195 0.816 1.859
CASF-2016 result 3772 285 0.864 1.568
PDB v2016 result 3767 290 0.866 1.561
PDB v2020 result 18904
(exclude core sets)
195
CASF-2007 core set
0.853 1.295
195
CASF-2013 core set
0.832 1.301
285
CASF-2016 core set
0.881 1.095

Note, there are 20 TopoFormers are trained for each dataset with distinct random seeds to address initialization-related errors. And 20 gradient boosting regressor tree (GBRT) models are subsequently trained one these sequence-based features, which predictions can be found in the results folder. Then, 10 models were randomly selected from TopoFormer and GBDT models, respectively, the consensus predictions of these models was used as the final prediction result. The performance shown in the table is the average result of this process performed 400 times.

  • Docking
Finetuned for docking Success rate
CASF-2007 result 93.3%
CASF-2013 result 91.3%
  • Screening
Finetuned for screening Success rate on 1% Success rate on 5% Success rate on 10% EF on 1% EF on 5% EF on 10%
CASF-2013 68% 81.5% 87.8% 29.6 9.7 5.6

Note, the EF here means the enhancement factor. Each target protein has a finetuned model. result contains all predictions.


License

This project is licensed under the MIT License - see the LICENSE file for details.


Citation

If you use this code or the pre-trained models in your work, please cite our work.

  • Chen, Dong, Jian Liu, and Guo-Wei Wei. "Multiscale topology-enabled structure-to-sequence transformer for protein–ligand interaction predictions." Nature Machine Intelligence (2024): 1-12. Read
# BibTex
@article{chen2024multiscale,
  title={Multiscale topology-enabled structure-to-sequence transformer for protein--ligand interaction predictions},
  author={Chen, Dong and Liu, Jian and Wei, Guo-Wei},
  journal={Nature Machine Intelligence},
  pages={1--12},
  year={2024},
  publisher={Nature Publishing Group UK London}
}

Acknowledgements

This project has benefited from the use of the Transformers library. Portions of the code in this project have been modified from the original code found in the Transformers repository.


Contributors

TopoFormer was developed by Dong Chen and is maintained by WeiLab at MSU Math