LigCare: Processing, comparison and alignment of protein pharmacophoric points to ligand pharmacophores
The goal of this project is to investigate the comparability of protein pockets represented as negative image to ligands.
Machine learning models were developed to automatically learn important features and simplify the cloud of points representing a protein binding site, without a priori knowledge of ligand binding area. Alignment of ligands to protein sites were performed by point cloud registration or by searching graph isomorphism.
Please note that this repository is in beta version.
- IChem
- Conda
- Conda environment: ligcare.yml
- Dataset: sc-PDB v.2022 (to be published data)
- scripts: python codes to generated and process data sets.
- data: input and output data organization with examples. The full data sets can be generated by accessing sc-PDB v.2022 and using the commands below.
- models: https://zenodo.org/record/7488034, examples of all trained models (binary) in models.tgz archive.
<script>.py --help
for options.
To signal issues: https://github.com/kimeguida/LigCare/issues
python scripts/cavity_descriptors.py -c data/cavities/2rh1_1_cavityALL.mol2 -p data/proteins/2rh1_1_protein.mol2 -o data/desc/2rh1_1_desc.npy -obsa bsa/2rh1_1_bsa.tsv
Computing the buriedness BSA is computationally clostly. If BSA data (.bsa files) are already available:
python scripts/cavity_descriptors.py -c data/cavities/2rh1_1_cavityALL.mol2 -p data/proteins/2rh1_1_protein.mol2 -o data/desc/2rh1_1_desc.npy -ibsa bsa/2rh1_1_bsa.tsv
Ligand pharmacophoric features (ph4) and interactions with the target:
python scripts/ligand_to_ph4.py --database --duplicate -l data/ligands.list -idir data/ligands -odir data/ligph4
IChem ints data/proteins/2rh1_1_protein.mol2 data/ligands/2rh1_1_ligand.mol2
python scripts/cavity_labels.py -c data/cavities/2rh1_1_cavityALL.mol2 -lph data/ligph4/2rh1_1_ligph4.mol2 -int data/ints/2rh1_1_ints.mol2 -o data/labels/2rh1_1_labels.npy
Training labels of cavity points for classification:
0: non-interacting
1: interacting
python scripts/cavity_labels.py -c data/cavities/2rh1_1_cavityALL.mol2 -lph data/ligph4/2rh1_1_ligph4.mol2 -int data/ints/2rh1_1_ints.mol2 -o data/labels/2rh1_1_labels.npy
After generation of descriptors and labels for the entire database, split into application, balanced training and test sets:
python scripts/prepare_data.py -f data/scpdb_2022.list -d data/desc -l data/labels -a data/scpdb_2022_annotation.tsv --split random > prepare_data.log
outputs:
features_split.json
: training and test sets of pointsexternal_test_pdb.json
: leave-out entire cavities, for applicationprepare_data.log
: statistics of data splitting into positive, negative and application set
Several splitting schemes are implemented for left-out cavities: random, GPCR, time, kinase, protease, nuclear receptor.
Example for random left-out. Random Forest (rf) or XGBoost (xgb) classifiers were tested. Each of the seven ph4 types (CA hydrophobic, CZ aromatic, O h-bond acceptor, OG h-bond acceptor and donor, OD1 negative ionizable, N h-bond donor, NZ positive ionizable) is trained individually.
python scripts/train.py -t data/split_train_test/random/features_split.json -d data/descriptors -clf rf
outputs:
<ph4>.report
: statistics of cross-validation, training and external tests<ph4>.model
: trained model binary, can be loaded with pickle
Further analyses are needed to assess the models.
Example for random left-out.
python scripts/application.py -v data/split_train_test/random/external_test_pdb.json -c data/scpdb_2022/cavities --models models/rf/$f/CA.model models/rf/random/CZ.model models/rf/random/O.model models/rf/random/OG.model models/rf/random/OD1.model models/rf/random/N.model models/rf/random/NZ.model --modelnames CA CZ O OG OD1 N NZ -d data/descriptors -o data/pred_pharm/rf/random/
For all left-out.
for f in random kinase gpcr nuclear_receptor protease time; do echo predicting for $f............; python scripts/application.py -v data/split_train_test /$f/external_test_pdb.json -c data/scpdb_2022/cavities --models models/rf/$f/CA.model models/rf/$f/CZ.model models/rf/$f/O.model models/rf/$f/OG.model models/rf/$f/OD1.model models/rf/$f/N.model models/rf/$f/NZ.model --modelnames CA CZ O OG OD1 N NZ -d data/descriptors -o data/pred_pharm/rf/$f/ done
outputs:
- pruned cavities: only points precticted to be interacting are kept in the pruned cavities.
- figures: balanced accuracy, specificity, sensitivity, degree of prunning
Example for the prediction of GPCR left-out cavities
python scripts/statistics_application.py -c data/cavities -pr data/pred_pharm/rf/gpcr/ -ap data/split_train_test/gpcr/external_test_pdb.json -d ../data/descriptors -l ../data/labels -an scpdb_2022_annotation.tsv
Transformation of ligands/molecules into pharmacophoric (ph4) features:
python scripts/ligand_to_ph4.py --database --duplicate -l data/ligands.list -idir data/ligands -odir data/ligph4
Align by searching common substructures in the ph4 features:
-
Using IChem VolSite grid-sampled cavities
python scripts/ligcare_graph.py -c data/cavities/2rh1_1_cavityALL.mol2 -p data/proteins/2rh1_1_protein.mol2 -l data/ligph4/2rh1_1_ligph4.mol2 -lig data/ligands/2rh1_1_ligand.mol2
-
Using projected irregular protein features
Compute protph4 features...
python scripts/protein_to_ph4.py -p data/proteins/2rh1_1_protein.mol2 -c data/cavities/2rh1_1_cavityALL.mol2 -o data/protph4/2rh1_1_photph4.mol2
... and compare
python scripts/ligcare_graph.py -c data/protph4/2rh1_1_photph4.mol2 -p data/proteins/2rh1_1_protein.mol2 -l data/ligph4/2rh1_1_ligph4.mol2 -lig data/ligands/2rh1_1_ligand.mol2
outputs:
rot_<ligand_name>_x.mol2
: ligand poses, x = {1, 2, 3, etc.}rot_<ligph4_name>_x.mol2
: ligand ph4 poses, x = {1, 2, 3, etc.}
By default, the top 20 solutions ranked by best RMSE of the aligned feature points are output.
Compute the ligvoxelplus representation of the ligand/molecule:
python scripts/ligvoxelplus.py -i data/ligands/2rh1_1_ligand.mol2 -o data/ligvoxelplus/2rh1_1_ligvoxelplus.mol2
Align by point cloud registration with ProCare (see requirements and usage) and apply roto/translation to the ligand atoms:
(procare) $ python procare_launcher.py -t data/cavities/2rh1_1_cavityALL.mol2 -s data/ligvoxelplus/2rh1_1_ligvoxelplus.mol2 --transform --ligandtransform data/ligands/2rh1_1_ligand.mol2
- Eguida M, 2022. Comparison of protein cavities by point cloud processing: principles and applications in drug design. PhD thesis
- Desaphy, J.; Azdimousa, K.; Kellenberger, E.; Rognan, D. Comparison and Druggability Prediction of Protein–Ligand Binding Sites from Pharmacophore-Annotated Cavity Shapes. J. Chem. Inf. Model. 2012, 52 (8), 2287–2299. https://doi.org/10.1021/ci300184x
- Da Silva, F.; Desaphy, J.; Rognan, D. IChem: A Versatile Toolkit for Detecting, Comparing, and Predicting Protein–Ligand Interactions. ChemMedChem 2018, 13 (6), 507–510. https://doi.org/10.1002/cmdc.201700505
- Eguida, M., Rognan, D. A Computer Vision Approach to Align and Compare Protein Cavities: Application to Fragment-Based Drug Design. J. Med. Chem. 2020. https://doi.org/10.1021/acs.jmedchem.0c00422