A repo for "HiSIF-DTA: A Hierarchical Semantic Information Fusion Framework for Drug-Target Affinity Prediction".
Exploring appropriate protein representation methods and improving protein information abundance is a critical step in enhancing the accuracy of DTA prediction. Recently, numerous deep learning-based models have been proposed to utilize sequential or structural features of target proteins.
However, these models capture only low-order semantics that exists in a single protein, while the high-order semantics abundant in biological networks are largely ignored. In this article, we propose HiSIF-DTAβa hierarchical semantic information fusion framework for DTA prediction.
In this framework, a hierarchical protein graph is constructed that includes not only contact map as low-order structural semantics but also protein-protein interaction network (PPI) as high-order functional semantics. Particularly, two distinct hierarchical fusion strategies (i.e., Top-down and Bottom-Up) are designed to integrate the different protein semantics, therefore contributing to a richer protein representation. Comprehensive experimental results demonstrate that HiSIF-DTA outperforms current state-of-the-art methods for prediction on the benchmark datasets of DTA task.
-
Download the GitHub repo of this project onto your local server:
git clone https://github.com/bixiangpeng/HiSIF-DTA
-
Create and activate virtual env:
conda create -n HiSIF python=3.7
andconda activate HiSIF
Install specified version of pytorch:
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
Install other python packages:
pip install -r requirements.txt \ && pip install torch-scatter==2.0.6 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu111.html \ && pip install torch-sparse==0.6.9 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu111.html \ && pip install torch-spline-conv==1.2.1 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu111.html
π‘ Note that the operating system we used is
ubuntu 22.04
and the version of Anaconda is23.3.1
. -
We also provide the Dockerfile to build the environment, please refer to the Dockerfile for more details. Make sure you have Docker installed locally, and simply run following command:
# Build the Docker image sudo docker build --build-arg env_name=HiSIF -t hisif-image:v1 . # Create and start the docker container sudo docker run --name hisif-con --gpus all -it hisif-image:v1 /bin/bash # Check whether the environment deployment is successful conda list
-
> HiSIF-DTA βββ baselines - Baseline models directory. All the baseline models we re-trained can be found in this directory. βββ data - Data directory. The detailed information can be found in next section. βββ models β βββ HGCN.py - Original model file, which includes both Top-Down (TDNet) and Bottom-Up(BUNet) semantic fusion models. β βββ HGCN_for_CPI.py - A model modified for datasets (Human) with large numbers of proteins. β βββ HGCN_for_Ablation.py - Three ablation variants we used in this study. βββ results - The reslut directory storing the experimental results and pre-trained models. β βββ davis / kiba / Human β βββ pretrained_BUNet.csv - A CSV file recording the optimal predicting results of BUNet on davis/kiba/Human. β βββ pretrained_BUNet.model - A file recording the optimal model parameters of BUNet on davis/kiba/Human. β βββ pretrained_TDNet.csv β βββ pretrained_TDNet.model βββ generate_contact_map.py - A Python script used to generate the contact map based on PDB files. βββ create_data.py - A python script used to convert original data to the input data that model needed. βββ utils.py - A python script recording the various tools needed for training. βββ training_for_DTA.py - A python script used to train the model on DTA dataset (davis or kiba). βββ training_for_CPI.py - A python script used to train the model on CPI dataset (Human). βββ test_for_DTA.py - A python script that reproduces the DTA prediction results using the pre-trained models. βββ test_for_CPI.py - A python script that reproduces the CPI prediction results using the pre-trained models. βββ test_for_Ablation.py - A python script that reproduces the ablation results using the pre-trained models. βββ grad_pre.py - A python script using backpropagation gradients to predict protein binding pockets. βββ requirements.txt - A txt file recording the python packages that model depend on to run. βββ Dockerfile - A file used to build the environment image via Docker. βββ experimental_results.ipynb - A notebook indicating the prediction results of our models and other baseline models.
-
There are three benchmark datasets were adopted in this project, including two DTA datasets (
Davis and KIBA
) and a CPI dataset (Human
).-
Download processed data
The data file (
data.zip
) of these three datasets can be downloaded from this link. Uncompress this file to get a 'data' folder containing all the original data and processed data.π³ Replacing the original 'data' folder by this new folder and then you can re-train or test our proposed model on Davis, KIBA or Human.
π³ For clarity, the file architecture of
data
directory is described as follows:> data βββ davis / kiba - DTA dataset directory. β βββ ligands_can.txt - A txt file recording ligands information (Original) β βββ proteins.txt - A txt file recording proteins information (Original) β βββ Y - A file recording binding affinity score (Original) β βββ folds β β βββ test_fold_setting1.txt - A txt file recording test set entry (Original) β β βββ train_fold_setting1.txt - A txt file recording training set entry (Original) β βββ (davis/kiba)_dict.txt - A txt file recording the corresponding Uniprot ID for every protein in datasets (processed) β βββ contact_map β β βββ (Uniprot ID).npy - A npy file recording the corresponding contact map for every protein in datasets (processed) β βββ PPI β β βββ ppi_data.pkl - A pkl file recording the related PPI network data including adjacency matrix (dense), β β feature matrix and the protein index in PPI (processed) β βββ train.csv - Training set data in CSV format (processed) β βββ test.csv - Test set data in CSV format (processed) β βββ mol_data.pkl - A pkl file recording drug graph data for all drugs in dataset (processed) β βββ pro_data.pkl - A pkl file recording protein graph data for all proteins in dataset (processed) βββ Human - CPI dataset directory. βββ Human.txt - A txt file recording the information of drugs and proteins that interact (Original) βββ contact_map β βββ (XXXXX).npy βββ PPI β βββ ppi_data.pkl βββ Human_dict.txt βββ train(fold).csv - 5-fold training set data in CSV format (processed) βββ test(fold).csv - 5-fold test set data in CSV format (processed) βββ mol_data.pkl βββ pro_data.pkl
-
Customize your data
You might like to test the model on more DTA or CPI datasets. If this is the case, please add your data in the folder 'data' and process them to be suitable for our model. We provide a detailed processing script for converting original data to the input data that our model needed, i.e.,
create_data.py
. The processing steps are as follows:- Split the raw dataset into training and test sets, and convert them into CSV format respectivelyοΌi.e.,
train.csv
andtest.csv
οΌ. The content of the csv file can be organized as follows:compound_iso_smiles target_sequence affinity C#Cc1cccc(Nc2ncnc3cc(OCCOC)c(OCCOC)cc23)c1 MAAVILESIFLKRSQQKKKTSPLNFKKRLFLLTVHKLSY 5.568636236 YEYDFERGRRGSKKGSIDVEKITCVETVVPEKNPPPERQ IPRRGEESSEMEQISIIERFPYPFQVVYDEGP
- Collect the Uniprot ID of all proteins in dataset from Uniprot DB(https://www.uniprot.org/) and record it as a txt file, such as
davis_dict.txt
:>MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILM... Q2M2I8 >PFWKILNPLLERGTYYYFMGQQPGKVLGDQRRPSLPALHFIKGAGKKESSRHGGPHCNVFVEHEALQRPVASDFEPQGLSEAARWNSKENLLAGPSENDPNLFVALYDFVASGDNTLSITKGEKLR... P00519
- Download the corresponding protein structure file from the PDBοΌhttps://www.rcsb.org/οΌ or Alphafold2(https://alphafold.com/) DB according to the Uniprot ID. Then you can get the contact map file by runing the following scripts:
python generate_contact_map.py --input_path '...data/your_dataset_name/your_pdb_dir/' --output_path '...data/your_dataset_name/your_contact_map_dir/' --chain_id 'A'
- Construct the graph data for drugs and proteins. Assume that you already have aboving files (1.2.3) in your
data/your_dataset_name/
folder, you can simply run following scripts:python created_data.py --path '..data/' --dataset 'your_dataset_name' --output_path '..data/'
- Finally, Upload the Uniprot IDs of all proteins in your dataset to the String DB(https://string-db.org/) for PPI networks data, and the feature descriptor of protein in PPI network we used can be available from Interpro (https://www.ebi.ac.uk/interpro/).
- Split the raw dataset into training and test sets, and convert them into CSV format respectivelyοΌi.e.,
π‘ Note that the above is just a description of the general steps, and you may need to make some modification to the original script for different datasets.
π ThereforeοΌWe have provided detailed comments on the functionality of each function in the script, hoping that it could be helpful for you.
-
-
After processing the data, you can retrain the model from scratch with the following command:
python training_for_DTA.py --model TDNet --epochs 2000 --batch 512 --LR 0.0005 --log_interval 20 --device 0 --dataset davis --num_workers 6 or python training_for_CPI.py --model BUNet --epochs 2000 --batch 512 --LR 0.0005 --log_interval 20 --device 0 --dataset kiba --num_workers 6
Here is the detailed introduction of the optional parameters when running
training_for_DTA/CPI.py
:--model: The model name, specifying the name of the model to be used.There are two optional backbones, BUNet and TDNet. --epochs: The number of epochs, specifying the number of iterations for training the model on the entire dataset. --batch: The batch size, specifying the number of samples in each training batch. --LR: The learning rate, controlling the rate at which model parameters are updated. --log_interval: The log interval, specifying the time interval for printing logs during training. --device: The device, specifying the GPU device number used for training. --dataset: The dataset name, specifying the dataset used for model training. --num_workers: This parameter is an optional value in the Dataloader, and when its value is greater than 0, it enables multiprocessing for data processing.
π³ We provided an additional training file (
training_for_CPI.py
) specifically for conducting five-fold cross-training on the Human dataset.π³ Additionally, due to the larger scale of proteins in the Human dataset, we have made modifications to the original architecture to alleviate the memory requirements. For detailed changes, please refer to the file
HGCN_for_CPI.py
. -
If you don't want to re-train the model, we provide pre-trained model parameters as shown below.
Datasets Pre-trained models Description Davis BUNet , TDNet The pretrained model parameters on the Davis. KIBA BUNet , TDNet The Pretrained model parameters on the KIBA. Human BUNet , TDNet The pretrained model parameters on the Human five-fold dataset. Based on these pre-trained models, you can perform DTA predictions by simply running the following command:
python test_for_DTA.py --model TDNet --dataset davis or python test_for_CPI.py --model BUNet --dataset Human
π‘ Note that before making predictions, in addition to placing the pre-trained model parameter files in the correct location, it is also necessary to place the required data files mentioned in the previous section in the appropriate location.
-
We have designed a protein semantic information fusion framework based on the concept of hierarchical graph to enhance the richness of protein representation. Meanwhile, we propose two different strategies for semantic information fusion (Top-Down and Bottom-Up) and evaluate their performance on different datasets. The performance of two different strategies on different datasets is as follows:
-
Performance on the Davis dataset
Backbone MSE CI TDNet (Top-Down) 0.193 0.907 BUNet (Bottom-Up) 0.191 0.906 -
Performance on the KIBA dataset
Backbone MSE CI TDNet (Top-Down) 0.120 0.904 BUNet (Bottom-Up) 0.121 0.904 -
Performance on the Human dataset
Backbone AUROC Precision Recall TDNet (Top-Down) 0.988 0.945 0.952 BUNet (Bottom-Up) 0.986 0.947 0.947
π³ The performance of baseline models can be found in
experimental_results.ipynb
orbaselines
directory. -
-
To facilitate the reproducibility of our experimental results, we have provided a Docker Image-based solution that allows for reproducing our experimental results on multiple datasets with just a single command. You can easily experience this function with the following simple commandοΌ
sudo docker run --name hisif-con --gpus all --shm-size=2g -v /your/local/path/HiSIF-DTA/:/media/HiSIF-DTA -it hisif-image:v1 # docker run οΌCreate and start a new container based on the specified image. # --name : It specifies the name ("hisif-con") for the container being created. You can use this name to reference and manage the container later. # --gpus : It enables GPU support within the container and assigns all available GPUs to it. This allows the container to utilize the GPU resources for computation. # -v : This is a parameter used to map local files to the container,and it is used in the following format: `-v /your/local/path/HiSIF-DTA:/mapped/container/path/HiSIF-DTA` # -it : These options are combined and used to allocate a pseudo-TTY and enable interactive mode, allowing the user to interact with the container's command-line interface. # hisif-image:v1 : It is a doker image, builded from Dockerfile. For detailed build instructions, please refer to the `Requirements` section.
π‘ Please note that the above one-click run is only applicable for the inference process and requires you to pre-place all the necessary processed data and pretrained models in the correct locations on your local machine. If you want to train the model in the created Docker container, please follow the instructions below:
1. sudo docker run --name hisif-con --gpus all --shm-size=16g -v /your/local/path/HiSIF-DTA/:/media/HiSIF-DTA -it hisif-image:v1 /bin/bash 2. cd /media/HiSIF-DTA 3. python training_for_DTA.py --dataset davis --model TDNet
To demonstrate the superiority of the proposed model, we conduct experiments to compare our approach with the following state-of-the-art (SOTA) models:
DTA:
- DeepDTA : Repo Link
- AttentionDTA : Repo Link
- GraphDTA : Repo Link
- MGraphDTA : Repo Link
- DGraphDTA : Repo Link
CPI:
π³ The above link is the GitHub link to the baseline models. To ensure a fair comparison, we re-trained these baseline models with the same experimental setup as our proposed model. The detailed re-training codes and results can be found in the baselines
directory.
To ensure the transparency of experimental results, the prediction results of all models (including our proposed model and baseline models) have been uploaded to Zenodo (Link). Additionally, in order to present the experimental results in a more intuitive way, we provide a comprehensive Jupyter notebook in our repo (experimental_results.ipynb
), where we load all prediction result files and recalculate the experimental metrics based on these results, presenting them in the form of statistical charts or tables.
We welcome you to contact us (email: bixiangpeng@stu.ouc.edu.cn) for any questions and cooperations.