predict_Kinase

Machine learning methods utilizing various structure-based and energy-based descriptors characterizing protein-ligand interactions to predict activity for novel compounds against a broad range of human kinases.

Predicting models are trained on the dataset of 104 human kinases with available PDB structures and with available experimental activity data against 1202 small-molecule compounds from PubChem BioAssay dataset ‘Navigating the Kinome’ (https://pubchem.ncbi.nlm.nih.gov/bioassay/493040#section=Top).

The project contains Jupyter notebook scripts for models training, separate scripts for models evaluating, saved files with pre-trained models, bash-script for the pipeline to prepare descriptors for machine learning.

Kinase-ligand complexes in our pipeline are obtained with SMINA docking software and docking scores are used as part of the descriptors set.

Descriptors preparation pipeline utilizes a number of tools and software, including

SMINA for docking and pose scoring
ICM
PLIP
X-Score
ACPYPE
GB_NSR6
R

Most of these tools are freely available except for ICM-Pro, also it was used only on the steps 1 and 3 for a few operations and could be replaced by any other suitable programs. We used ICM in our pipeline particularly to save protein-ligand complex structures in PDB files after docking for further analysis and to calculate protein-ligand interaction surface (ICM_area), number of protein-ligand hydrogen bonds (ICM_hbonds), number of atoms in ligands (nof_Atoms), number of rotational bonds in ligands (nof_RotB).

Descriptors preparation pipeline can be run as bash run_pipeline.sh, please correct pathways to software in a header of the script.

Pipeline contains the following steps:

Calculate nof_Atoms and nof_RotB for ligands with ICM-Pro.
Calculate solvation energy for free ligands.

Prepare PQR files for ligands to run r6_born utility with Amber tools and ACPYPE script.
Run GB_NSR6 utility to calculate solvation energy of free ligands.

Run docking with SMINA and process docking results with ICM-Pro.
Re-score protein-ligand complexes with X-score.
Analyze protein-ligand contacts with PLIP (pyMol based tool).
Join tables into summary descriptors table 'test_compl_descr.tab' with R.

Kinases PDB files that should be used for docking can be found in recept.tar.gz. Receptors list is available in file recept.tab. Number of descriptors associated with receptors are already pre-calculated and available in tarSim_allPairs.tab

The main script for activity prediction models' training can be found in Jupyter notebook file train_main_model.ipynb, the script contains descriptors' processing and construction and training of the two activity predicting models: DNN model, built with Keras, and Random Forest model built with Scikit-learn library.

arinaafan/predict_Kinase

predict_Kinase