LLF (Leveraging Local Features)

Abstract

Binding affinity prediction has been considered as a fundamental task in drug discovery. Although there has been much effort to predict the binding affinity of a pair of a protein se- quence and a drug, which has provided valuable insights on developing machine-learning- based predictors, the components or layers of the prediction models proposed by the prior work are designed not to preserve comprehensive features of local structures of a drug and a sequence of target protein. In this paper, we propose a deep learning model that concen- trates more on local structures of both a drug and a sequence of target protein. To this end, the proposed model employs two modules, R-CNN and R-GCN, each of which is responsible for extracting the comprehensive features from subsequences of a target protein sequence and subgraph of a drug, respectively. With multiple streams with different numbers of layers, both modules not only computes the comprehensive features with multiple CNN and GCN layers, but also preserve the local features computed by a single layer. Based on the evaluation with two popular datasets, Davis and KIBA, we demonstrates that the proposed model shows the competitive performance on both the datasets and keeping local features can play significant roles of binding affinity prediction.

Model Figure

Dataset

Result Table

The components of our model

data_creation.py : converting datasets into PyTorch Geometric format. It's used for data preprocessing in Binding affinity prediction tasks.

emetrics.py : calculating various evaluation metrics such as Concordance Index, Mean Squared Error, R-squared, Pearson Correlation, and Area Under the Precision-Recall Curve (AUPR).

gcn.py : consisting of graph convolutional layers for processing molecular graphs (SMILES), convolutional layers for processing protein sequences, and fully connected layers for combining the features from both branches and making predictions.

training.py : training and testing a model on a given dataset using PyTorch.

utils.py : It preprocesses the input data (SMILES, target sequences, and affinities) into a format suitable for model training.

How to use our codes

All the requirements are listed in the requirements.txt file

Step 1: Download the file that matches the dataset you want to use from the Dataset download links.

Step 2: Use the (data_creation.py) file for data preprocessing.

Step 3: Use (the training.py) file to train the model using the provided data.

Step 4: You can check the model performance for a specific epoch using the score values specified in (emetrics.py).

Dataset download links

Dataset	Dataset download links
davis_train	Link
davis_test	Link
kiba_train	Link
kiba_test	Link

You can Download dataset from Google Drive for particular data(.csv).

mathcom/LLF