/paccmann_kinase_binding_residues

Comparison of active site and full kinase sequences for drug-target affinity prediction and molecular generation. Full paper: https://pubs.acs.org/doi/10.1021/acs.jcim.1c00889

Primary LanguagePythonMIT LicenseMIT

Active sites outperform full proteins for modeling kinases

Python package License: MIT Code style: black DOI:10.1021/acs.jcim.1c00889 DOI:10.1021/acs.jcim.2c00840

Summary

This repository contains data & code for the JCIM paper: Active Site Sequence Representations of Human Kinases Outperform Full Sequence Representations for Affinity Prediction and Inhibitor Generation: 3D Effects in a 1D Model. We study the impact of different protein sequence representations for modeling human kinases. We find that using active site residues yields superior performance to using full protein sequences for predicting binding affinity. We also study the difference of active site vs. full sequence on de-novo design tasks. We generate kinase inhibitors directly from protein sequences with our previously developed hybrid-VAE (PaccMannRL) but find no major differences between both kinase representations.

News

Summary GIF

Description

This repository facilitates the reproduction of the experiments conducted in the JCIM paper Active Site Sequence Representations of Human Kinases Outperform Full Sequence Representations for Affinity Prediction and Inhibitor Generation: 3D Effects in a 1D Model. We provide scripts to:

  1. Train and evaluate the BimodalMCA for drug-protein affinity prediction
  2. Evaluate the bimodal KNN affinity predictor either in a CV setting or on a plain train/test script
  3. Optimize a SMILES- or SELFIES-based molecular generative model to produce molecules with high binding affinities for a protein of interest (affinity is predicted with the KNN model).

Data

The preprocessed BindingDB data (CV and test data for ligand split and kinase split, data used for pretraining and affinity optimization) can be accessed on this Box link. We also release the aligned active site sequences (29 residues) for all kinases. If you use the data, please cite our work.

Installation

The core functionality of this repo is provided in the pkbr package which can be installed (in editable mode) by pip install -e .. In order to execute the example scripts, we recommend setting up a conda environment:

conda env create -f conda.yml
conda activate pkbr

Afterwards you can execute the scripts to run the KNN, the BiMCA or the affinity optimization.

Code examples

All examples assume that you downloaded the data from this Box link and stored it in a folder data in the root of this repo.

Running the KNN affinity predictor in a cross validation

python3 scripts/knn_cv.py -d data/ligand_split -tr train.csv -te validation.csv \
-f 10 -lp data/ligands.smi -kp data/human_kinases_active_site.smi -r my_cv_results

Running KNN in single train/test split

python3 scripts/knn_test.py -t data/ligand_split/fold_0/validation.csv -test data/ligand_split/fold_0/test.csv \
-tlp data/ligands.smi -telp data/ligands.smi -kp data/human_kinases_active_site.smi -r my_results.csv

Training the BiMCA model

python3 scripts/bimca_train.py \
	data/ligand_split/fold_0/train.csv data/ligand_split/fold_0/validation.csv data/ligand_split/fold_0/test.csv \
	human-kinase-alignment data/human_kinases_active_site.smi data/ligands.smi data/smiles_vocab.json \
	models config/active_site.json -n my_as_model

Evaluating the BiMCA model

python3 scripts/bimca_test.py \
data/ligand_split/fold_0/validation.csv human-kinase-alignment data/human_kinases_sequence.smi data/ligands.smi \
path_to_your_trained_model

Affinity optimization with SMILES generator

To execute this part you need to utilize the pretrained SMILES/SELFIES VAE stored under data/models

python scripts/gp_generation_smiles_knn.py \
    data/models/smiles_vae data/affinity_optimization/bindingdb_all_kinase_active_site.csv \
    data/affinity_optimization/example_active_site.smi \
    smiles_active_site_generator/ -s 42 -t 0.85 -r 10 -n 50 -s 42 -i 40 -c 80

Affinity optimization with SELFIES generator

python scripts/gp_generation_selfies_knn.py \
    data/models/selfies_vae data/affinity_optimization/bindingdb_all_kinase_sequence.csv \
    data/affinity_optimization/example_sequence.smi \
    selfies_sequence_generator/ -s 42 -t 0.85 -r 10 -n 50 -s 42 -i 40 -c 80

Choosing active-site sequences

See our new letter in JCIM.

Definitions

How to exactly define an "active site" is a critical choice. While we originally relied on the definition by Sheridan et al. (2009) we have compared now to the active site definition by Martin et al. (2012) and a Combined definition that uses a total of 35 residues from either definitions. This improves performance significantly, especially for allosteric binders.

Augmentations

We also devised novel protein sequence augmentation schemes by flipping/swapping contiguous active-site subsequences that lie discontiguously in the full protein sequence.

To train a model with the new active site definition and the new augmentation, follow the installation setup and download the data from Box.

Afterwards run:

python3 scripts/bimca_train.py \
	data/ligand_split/fold_0/train.csv data/ligand_split/fold_0/validation.csv data/ligand_split/fold_0/test.csv \
	human-kinase-alignment data/human_kinases_active_site_combined.smi data/ligands.smi data/smiles_vocab.json \
	models config/active_site_augment.json -n combined_augment

You can modify config/active_site_augment.json to play with the hyperparameters and the probability of the augmentations.

Citation

If you use this repo, our data, the proposed active site definitions or the sequence augmentation strategies in your projects, please cite the following:

@article{born2022active,
	author = {Born, Jannis and Huynh, Tien and Stroobants, Astrid and Cornell, Wendy D. and Manica, Matteo},
	title = {Active Site Sequence Representations of Human Kinases Outperform Full Sequence Representations for Affinity Prediction and Inhibitor Generation: 3D Effects in a 1D Model},
	journal = {Journal of Chemical Information and Modeling},
	volume = {62},
	number = {2},
	pages = {240-257},
	year = {2022},
	doi = {10.1021/acs.jcim.1c00889},
	note ={PMID: 34905358},
	URL = {https://doi.org/10.1021/acs.jcim.1c00889}
}

@article{born2022on,
        author = {Born, Jannis and Shoshan, Yoel and Huynh, Tien and Cornell, Wendy D. and Martin, Eric J. and Manica, Matteo},
        title = {On the Choice of Active Site Sequences for Kinase-Ligand Affinity Prediction},
        journal = {Journal of Chemical Information and Modeling},
        volume = {62},
        number = {18},
        pages = {4295-4299},
        year = {2022},
        doi = {10.1021/acs.jcim.2c00840},
        URL = {https://doi.org/10.1021/acs.jcim.2c00840}
}