This repository contains a reference implementation to process the data, build the graph, to train and apply the graph neural network model introduced in the paper "G-PLIP: Knowledge graph neural network for structure-free protein-ligand affinity prediction" by Simon J. Crouzet, Anja Maria Lieberherr, Kenneth Atz, Tobias Nilsson, Lisa Sach-Peltason, Alex T. Müller, Matteo Dal Peraro, and Jitao David Zhang.
Create and activate the gplip environment.
conda env create -f requirements.yml
conda activate g-plip
Due to its size, the model is running on CPU. Thus, you don't need to install cuda-related packages.
With its minimal architecture, the model is still able to train on CPUs in a relatively short time. You'll however need an important amount of available RAM.
The data/
directory needs:
- The PDBBind v2020 data, which can be retrieved from the PDBBind website. You need both the "general set minus refined set (2)" and the "refined set (3)", to unzip them and merge the two in the folder
data/v2020-complete-PL
- The PDBBind CASF 2016 Benchmark set, which can be retrieved from the PDBBind website. You need the "CASF-2016" data, unzipped in the folder
data/CASF-2016
- The
Mapping_HumanProteins_20221006.tsv
, provided, giving a mapping between GeneID, UniProtID and EnsemblID for human proteins - The
CASF_Mapping_HumanProteins_20230627.tsv
, provided, giving a mapping between PDB Code and UniProtID for human proteins in the CASF Core Set. Please note that this mapping has been manually refined by double cheking the PDB entries. - A PPI database giving pairs of GeneIDs interacting. By default, we used the database from the publication "Large-scale analysis of disease pathways in the human interactome" by M. Agrawal, M. Zitnik & J. Leskovec, available here
- Gene expression data, retrieved by default from the Human Protein Atlas, where we took the "RNA consensus tissue gene data (4)"
We provide the prebuild graph from the PDBBind data using the default settings in configs/config.ini
:
(please keep the name intact)
N.B.: Note you still need the raw data - G-PLIP is performing a file check first.
With the raw data correctly imported, you can proceed to the graph building with python -m scripts.construct_pdbbind
.
The model can be trained on the PDBBind dataset, using python -m scripts.pdbbind_pipeline [args]
.
- By default, evaluation is done on the 2016 CASF Core Set. For an evaluation on the 2019 Temporal Hold-out Set, use
--split_type temporal
- To use the
refined
orother
subset of PDBBind rather than thecomplete
, use--pdbbind_set refined
or--pdbbind_set other
. For that, you will need the corresponding data in the foldersdata/v2020-refined-PL
anddata/v2020-other-PL
, as present by default in the PDBBind zip files.
The software was developed at F. Hoffmann - La Roche Ltd. and is licensed by the license CreativeCommons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), i.e. described in LICENSE
.
@misc{crouzet_pliprediction_2023,
title = {G-{PLIP}: {Knowledge} graph neural network for structure-free protein-ligand affinity prediction},
doi = {10.1101/2023.09.01.555977},
publisher = {bioRxiv},
author = {Crouzet, Simon J. and Lieberherr, Anja Maria and Atz, Kenneth and Nilsson, Tobias and Sach-Peltason, Lisa and M{\"u}ller, Alex T. and Peraro, Matteo Dal and Zhang, Jitao David},
month = sep,
year = {2023},
}