Interaction-Based Inductive Bias in Graph Neural Networks: Enhancing Protein-Ligand Binding Affinity Predictions from 3D Structures

This repository contains the source code for structure-based virtual screening (SBVS). For protein-ligand affinity (PLA) predictions, please refer to our dedicated repository at EHIGN_PLA on GitHub.

Dataset

The LIP-PCBA dataset is publicly available at the following locations:

Original LIT-PCBA [1]: LIT-PCBA
Docked Data (with 3D structures for small compounds) [2]: 3D Structures

Preprocessed data (molecular graphs) can be downloaded from:

Requirements

The following Python packages are required:
dgl==0.9.0
networkx==2.5
numpy==1.19.2
pandas==1.1.5
pymol==0.1.0
rdkit==2022.3.5
scikit_learn==1.1.2
scipy==1.5.2
torch==1.10.2
tqdm==4.63.0
openbabel==3.3.1 (conda install -c conda-forge openbabel)

Alternatively, install the environment using the provided YAML file at ./environment.yaml.

Structure and Descriptions

Directories

./config: Parameters used in EHIGN.
./log: Logger.
./model: Contains several trained models for reproducing results.

Files

CIGConv.py, NIGConv.py, EHIGN.py: Implementations of CIGConv, NIGConv, and EHIGN.
HGC.py: Heterogeneous graph neural network implementation (modified from dgl source code).
preprocess_complex.py: Prepare input complexes.
graph_constructor.py: Convert protein-ligand complexes into heterogeneous graphs.
train.py: Train the EHIGN model.
test.py: Use models in ./model directory for prediction.

Step-by-step Running

Organize the Data

Download processed data from Graphs 1 and Graphs 2.
Organize the data as follows:

-docking_poses
-ALDH1_4x4l
-train
-ALDH1_4x4l_decoys_22407376-EHIGN.dgl
...
-val
...
-FEN1_5fv7
...
-GBA_2v3e
...
...

Reproduce Results

The ./model directory contains seven trained models for reproducing results.

Train the Model

run python train.py --data_root your_own_data_path/docking_poses

Test the Model

run python test.py --data_root your_own_data_path/docking_poses
By default, the seven trained models in the ./model directory are used.

Process Raw Data

First, run python preprocess_complex.py --data_root your_own_data_path/docking_poses
Then, run python graph_constructor.py --data_root your_own_data_path/docking_poses to generate graphs

Reference

[1] Tran-Nguyen V K, Jacquemard C, Rognan D. LIT-PCBA: an unbiased data set for machine learning and virtual screening[J]. Journal of chemical information and modeling, 2020, 60(9): 4263-4273.
[2] Shen C, Weng G, Zhang X, et al. Accuracy or novelty: what can we gain from target-specific machine-learning-based scoring functions in virtual screening?[J]. Briefings in Bioinformatics, 2021, 22(5): bbaa410.

guaguabujianle/EHIGN_SBVS