Multi-Scale Protein Sequence-Structure Contrasting for Compound-Protein Interaction Prediction (PSC-CPI)
This is a PyTorch implementation of Protein Sequence-sructure Contrasting for CPI Prediction (PSC-CPI), and the code includes the following modules:
-
Dataset loader (train/val/test)
-
Four evaluation settings: Seen-Both, Unseen-Compound, Unseen-Protein, and Unseen-Both
-
Four evaluation metrics: CPI pattern prediction (AUPRC and AUROC) and CPI Strength Prediction (RMSE and PPCs)
-
Pre-training, fine-tuning, and inference paradigm
- numpy==1.21.6
- scipy==1.7.3
- torch==1.6.0
- sklearn == 1.0.2
The datasets used in this paper are available in:
https://drive.google.com/file/d/1_iZ8B1JZkCKmKlQNewOCr3kbnWfAIc-r/view?usp=sharing
-
train.py
- Pre-training, fine-tuning, and inference
-
models.py
- ProteinEmbed_Model() - Learning protein sequence and structure representations
- prot_data_aug() -- Data augmentation on proteins
- loss_inter() -- loss for CPI pattern prediction
- loss_affn() -- loss for CPI strength prediction
- loss_contras() -- loss for (pre-training) multi-scale contrastive learning
- ProteinEmbed_Model() - Learning protein sequence and structure representations
-
dataloader.py
- data_loader() -- Load train, val, and test data (with four evaluation data spilts)
-
utils.py
- set_seed() -- Set radom seeds for reproducible results
- cal_affinity_torch() -- Use Pytorch to calculate CPI affinity (RMSE and PPCs)
- cal_interaction_torch() -- Use Pytorch to calculate CPI pattern (AUPRC and AUROC)
-
Install the required dependency packages
-
To pre-train and fine-tune the model, please run with proper hyperparameters:
python train.py --task_mode 0 --modality seq_str_linear --pre-train 1 --seq_encoder HRNN --str_encoder GAT
where (1) task_mode is one of the two CPI tasks: 0 (Strength Prediction) and 1 (Pattern Prediction); (2) modality is one of the three inference settings: 'seq_str_linear' (both two modalities are provided), 'sequence' (only sequence is provided), and 'structure' (only structure is provided); (3) pre-train denotes whether the pre-training is conducted: 0 (w/o pre-training) and 1 (w/ pre-training); (4) seq_encoder is one of the four protein sequence encoders: HRNN, LSTM, bi-LSTM, and Transformer; and (5) str_encoder is one of the three protein structure encoders: GCN, GAT, and SAGE.
If you find this project useful for your research, please use the following BibTeX entry.
@article{wu2024psc,
title={Psc-cpi: Multi-scale protein sequence-structure contrasting for efficient and generalizable compound-protein interaction prediction},
author={Wu, Lirong and Huang, Yufei and Tan, Cheng and Gao, Zhangyang and Hu, Bozhen and Lin, Haitao and Liu, Zicheng and Li, Stan Z},
journal={arXiv preprint arXiv:2402.08198},
year={2024}
}