Human-in-the-loop De novo Drug Design by Preference Learning (HIL-DD)
About
This directory contains the code and resources of the following paper:
"Human-in-the-loop De novo Drug Design by Preference Learning". Under review.
-
HIL-DD is a new Human-In-the-Loop Drug Design framework that enables human experts and AI to co-design molecules in 3D space conditioned on a protein pocket.
-
The backbone model is a surprisingly simple generative model called rectified flow (RF) based on ordinary differential equation (ODE) [1-2]. By combining equivariant graph neural networks (EGNNs) [3], we create an equivariant rectified flow model (ERFM).
-
Our HIL-DD framework is built upon ERFM. It takes molecules generated by ERFM conditioning on a protein pocket as input, and incorporates human experts' preferences to generate new molecules with human preferences.
-
Our experimental results are based on the CrossDocked dataset [4], which is available here.
-
If you have any issues using this software, please do not hesitate to contact Youming Zhao (youming0.zhao@gmail.com). We will try our best to assist you. Any feedback is highly appreciated.
Overview of the Model
We introduce HIL-DD to bridge the gap between human experts and AI.
Step 1. Construct an equivariant rectified flow model (ERFM) and train it on the CrossDocked dataset
In this step, we combine EGNNs and RF to create the ERFM. The ERFM is then trained on the CrossDocked dataset using protein pockets as a condition.
Step 2. Generate samples
We utilize a well-trained ERFM to generate molecules conditioned on a protein pocket of interest.
Step 3. Propose promising molecules as positive samples and unpromising molecules as negative samples
According to a specific preference, say binding affinity, given the generated samples, we select molecules with high binding affinity (measured by Vina score in our work) as promising samples, and molecules with low binding affinity as unpromising samples.
Step 4. Employ our HIL-DD algorithm to finetune ERFM
With the human annotations obtained from the previous step, we finetune the well-trained ERFM using the HIL-DD algorithm.
For more detailed information, please refer to our paper.
Generated molecules of ERFM and HIL-DD
In this section, we present the molecules generated by HIL-DD for six preferences, namely, Vina score, bond angle, benzene ring, bond length, dihedral angle deviation, avoiding large ring. Meanwhile, we provide the molecules generated by ERFM with the same noise which are used to generate these molecules by HIL-DD. These examples demonstrate that HIL-DD can capture human preferences well.
Sub-directories
- [configs] contains configuration files for training, sampling, and finetuning.
- [statistics] includes statistics about the training data, such as bond-length distribution, bond-angle distribution, and dihedral-angle distribution, etc.
- [datasets] contains files for preprocessing data.
- [models] contains the architectures of ERFM and the classifier.
- [toy-experiment] contains code for conducting a toy experiment to validate our algorithm.
- [utils] contains various helper functions.
Toy experiment
Before you check out our ERFM and HIL-DD, you can try to run the toy experiment. You will see the beauty of preference learning in a few minutes.
Dependencies
We recommend using Anaconda to create an environment for installing all dependencies. If you have Anaconda installed, please run the following command to install all packages. Normally, this can be done within a few minutes:
conda create --name HIL-DD --file configs/spec-file.txt
The main dependencies are as follows:
- Python=3.10
- PyTorch==1.12.1
- PyTorch Geometric==2.1.0
- NumPy==1.23.3
- OpenBabel==3.1.1
- RDKit==2022.03.5
- QVina==2.1.0
- SciPy==1.9.1
Vina Docking Score Calculation
To calculate Vina docking scores, you need to download the full protein pocket files from
here and
place them in the configs
folder. Then, unzip the files.
Prepare Receptor Files
If all your experiments are based on the CrossDocked dataset, please skip the following two steps.
If you want to compute the binding affinity for the generated molecules conditioned on your own pocket, it is recommended to create a separate environment to install MGLTools. This is because MGLTools and OpenBabel may not be compatible.
- Put the untailored PDB file under the
examples/
folder and run the following command:
python utils/prepare_receptor4.py -r examples/xxxx_full.pdb -o examples/xxxx_full.pdbqt
- Put the tailored PDB file under
examples/
.
Data
We trained/tested ERFM and HIL-DD using the same datasets as SBDD,
Pocket2Mol, and TargetDiff.
If you only want to sample molecules for the pockets of the CrossDocked test set,
we have stored those pockets in configs/test_data_list.pt
, so you can skip the following steps.
- Download the dataset archive
crossdocked_pocket10.tar.gz
and the split filesplit_by_name.pt
from this link and place them underdata/
. - Extract the TAR archive using the command:
tar -xzvf crossdocked_pocket10.tar.gz
.
Please note that it may take approximately 2 hours to preprocess the data when training ERFM or HIL-DD for the first time.
Prepare proposals for HIL-DD
To prepare proposals for HIL-DD, please follow the steps below:
-
Choose a protein pocket of interest either from the test set or from another dataset. If the protein pocket of interest is a member of the CrossDocked test set, refer to this .csv file for the corresponding PDB ID.
-
To sample molecules from the chosen protein pocket, use the following command:
python sampling.py --device cuda --config configs/sampling.yml --pocket_id 4 --num_samples 1000
Make sure to replace the --pocket_id
value with the index of the desired pocket.
Run this command 13 times to generate 13 result files. These result files will be used to select good and bad samples.
Note that if you don't mind the samples overlapping among the 12 preference injections, you can run the command only 3 times.
-
Move all the result files from
logs_sampling/datetime/sample-results/datetime.pt
to a new folder namedtmp/samples_pocket4
. -
Calculate the metrics for the samples using the following command:
python cal_metric_one_pocket.py tmp/samples_pocket4
- Select the good and bad molecules using the command:
python select_proposals.py tmp/samples_pocket4 tmp/samples_pocket4_proposals
In the select_proposals.py
file, you can specify the lower and upper thresholds for various preferences such as
Vina score, bond angle, bond length, benzene ring, large ring, and dihedral angle deviation. By default, the thresholds
for Vina score are -7 and -9. For more details, please refer to the last lines of the select_proposals.py
file.
The minimum number of positive and negative samples is determined by config.pref.num_positive_samples x config.pref.proposal_factor
and config.pref.num_negative_samples x config.pref.proposal_factor
, respectively.
Code Usage
To train ERFM, use the following command:
python train_ERFM.py --device cuda --config configs/config_ERFM.yml
To sample with a pretrained ERFM for all 100 pockets in the CrossDocked test set, run the following command:
python sampling.py --device cuda --config configs/sampling.yml
To sample with a protein pocket that is not in the CrossDocked test set, make sure to place your PDB file under the examples/
directory.
Then, execute the following command:
python sampling4pocket.py --device cuda --config configs/sampling.yml --pdb_path examples/2V3R.pdb
If you need to calculate the binding affinity, ensure that you have the complete protein pocket file in the examples/
directory.
Then, run the command as shown below:
python sampling4pocket.py --device cuda --config configs/sampling.yml --pdb_path examples/2V3R.pdb --receptor_path examples/2V3R_full.pdbqt
To finetune a pretrained ERFM, use the following command:
python HIL_DD_pref.py --device cuda --config configs/config_pref.yml
License
HIL-DD is licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0
Reference
[1]. Liu, Xingchao, et al. "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow." ICLR (2023).
[2]. Liu, Qiang. "Rectified flow: A marginal preserving approach to optimal transport." arXiv preprint arXiv:2209.14577.
[3]. Satorras, Victor Garcia, et al. "E (n) equivariant graph neural networks." ICML (2021).
[4]. Francoeur, Paul G et al. "Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design." Journal of chemical information and modeling. (2020).