/EHIGN_SBVS

Primary LanguagePythonMIT LicenseMIT

Interaction-Based Inductive Bias in Graph Neural Networks: Enhancing Protein-Ligand Binding Affinity Predictions from 3D Structures

This repository contains the source code for structure-based virtual screening (SBVS). For protein-ligand affinity (PLA) predictions, please refer to our dedicated repository at EHIGN_PLA on GitHub.

Dataset

The LIP-PCBA dataset is publicly available at the following locations:

  • Original LIT-PCBA [1]: LIT-PCBA
  • Docked Data (with 3D structures for small compounds) [2]: 3D Structures

Preprocessed data (molecular graphs) can be downloaded from:

Requirements

The following Python packages are required:
dgl==0.9.0
networkx==2.5
numpy==1.19.2
pandas==1.1.5
pymol==0.1.0
rdkit==2022.3.5
scikit_learn==1.1.2
scipy==1.5.2
torch==1.10.2
tqdm==4.63.0
openbabel==3.3.1 (conda install -c conda-forge openbabel)

Alternatively, install the environment using the provided YAML file at ./environment.yaml.

Structure and Descriptions

Directories

  • ./config: Parameters used in EHIGN.
  • ./log: Logger.
  • ./model: Contains several trained models for reproducing results.

Files

  • CIGConv.py, NIGConv.py, EHIGN.py: Implementations of CIGConv, NIGConv, and EHIGN.
  • HGC.py: Heterogeneous graph neural network implementation (modified from dgl source code).
  • preprocess_complex.py: Prepare input complexes.
  • graph_constructor.py: Convert protein-ligand complexes into heterogeneous graphs.
  • train.py: Train the EHIGN model.
  • test.py: Use models in ./model directory for prediction.

Step-by-step Running

Organize the Data

Download processed data from Graphs 1 and Graphs 2.
Organize the data as follows:

-docking_poses
  -ALDH1_4x4l
     -train
       -ALDH1_4x4l_decoys_22407376-EHIGN.dgl
       ...
     -val
       ...
  -FEN1_5fv7
     ...
  -GBA_2v3e
     ...
...

Reproduce Results

The ./model directory contains seven trained models for reproducing results.

Train the Model

run python train.py --data_root your_own_data_path/docking_poses

Test the Model

run python test.py --data_root your_own_data_path/docking_poses
By default, the seven trained models in the ./model directory are used.

Process Raw Data

First, run python preprocess_complex.py --data_root your_own_data_path/docking_poses
Then, run python graph_constructor.py --data_root your_own_data_path/docking_poses to generate graphs

Reference

[1] Tran-Nguyen V K, Jacquemard C, Rognan D. LIT-PCBA: an unbiased data set for machine learning and virtual screening[J]. Journal of chemical information and modeling, 2020, 60(9): 4263-4273.
[2] Shen C, Weng G, Zhang X, et al. Accuracy or novelty: what can we gain from target-specific machine-learning-based scoring functions in virtual screening?[J]. Briefings in Bioinformatics, 2021, 22(5): bbaa410.