
Primary LanguageJupyter NotebookMIT License

Structure-based Drug Design Benchmark: Do 3D Methods Really Dominate?

This repository hosts an open-source benchmark for Structure-based Drug Design, to facilitate the transparent and reproducible evaluation of algorithmic advances in molecular optimization. This repository supports 16 Structure-based Drug Design algorithms on 7 tasks.


There are two environments: Test Env and TDC Env. Test Env is used to run these models: 3DSBDD, Pocket2mol, PockFlow, ResGen and Autogrow4. TDC Env is used to run the rest of the models and evaluate all the models' generated molecules.

conda env create -f environment_TestEnv.yml
conda activate TestEnv2

16 Methods

Based the ML methodologies, all the methods are categorized into:

  • virtual screening
    • screening randomly search ZINC database.
  • GA (genetic algorithm)
    • graph_ga based on molecular graph.
    • smiles_ga based on SMILES
    • Autogrow4 based on SMILES
  • VAE (variational auto-encoder)
    • smiles_vae based on SMILES
    • selfies_vae based on SELFIES
  • RL (reinforcement learning)
    • reinvent
    • moldqn
  • HC (hill climbing)
    • smiles_lstm_hc is SMILES-level HC.
    • mimosa is graph-level HC
  • gradient (gradient ascent)
    • dst is based molecular graph.
    • pasithea is based on SELFIES.
  • Auto-regressive
    • 3DSBDD
    • Pocket2mol
    • PocketFlow
    • ResGen

time is the average rough clock time for a single run in our benchmark and do not involve the time for pretraining and data preprocess. We have processed the data, pretrained the model. Both are available in the repository.

Model Dimension Generated Number requires_gpu
3DSBDD 3D 771 yes
AutoGrow4 2D 1233 yes
Pocket2mol 3D 928 yes
PocketFlow 3D 1000 yes
RenGen 3D 631 yes
DST 2D 1001 no
Graph GA 2D 643 no
MIMOSA 2D 1001 yes
MolDQN 2D 501 yes
Pasithea 1D 914 yes
REINVENT 1D 100 yes
SCREENING - 1000 no
SMILES-GA 1D 584 no
SMILES-VAE-BO 1D 200 yes

PDB information

All the PDB files can be downloaded from RCSB Protein Data Bank. The blinding sites are as follow:

PDB center(x,y,z) bounding box size
1iep 15.6138918, 53.38013513, 15.454837 15
3eml -9.06363, -7.1446, 55.86259999 15
3ny8 2.2488, 4.68495, 51.39820000000001 15 (23 for Pocket2mol)
4rlu -0.73599, 22.75547, -31.23689 15
4unn 5.684346153, 18.1917, -7.3715 15
5mo4 -44.901, 20.490354, 8.48335 15
7l11 -21.81481, -4.21606, -27.98378 15 (23 for Pocket2mol)

Sampling and evaluating

For 3DSBDD and Pocket2mol, we use this command to generate:

python sample_for_pdb.py --pdb_path [your pdb] --center=[centers] --bbox_size [box size] --outdir [your outdir]

Also need to change the num_samples in the sample_for_pdb.yml

For PocketFlow, we use this command to generate:

python main_generate.py -pkt [your pdb] --ckpt ckpt/ZINC-pretrained-255000.pt -n 1000 -d cuda:0 --root_path [your outdir] --name [pdb name] -at 1.0 -bt 1.0 --max_atom_num 35 -ft 0.5 -cm True --with_print True

For ResGen, we first convert our pdb file to sdf file and use this command to generate:

python gen.py --pdb_file [your pdb] --sdf_file [correspond sdf] --outdir [your outdir]

For Autogrow4, we recommend following their tutorial before running the generation command:

python RunAutogrow.py \
    --filename_of_receptor [your pdb] \
    --center_x [center x] --center_y  [center y] --center_z [center z] \
    --size_x [box size] --size_y [box size] --size_z [box size] \
    --source_compound_file /autogrow4/autogrow/source_compounds/naphthalene_smiles.smi \
    --root_output_folder /PATH_TO/output_directory/ \
    --number_of_mutants_first_generation 50 \
    --number_of_crossovers_first_generation 50 \
    --number_of_mutants 50 \
    --number_of_crossovers 50 \
    --top_mols_to_seed_next_generation 50 \
    --number_elitism_advance_from_previous_gen 50 \
    --number_elitism_advance_from_previous_gen_first_generation 10 \
    --diversity_mols_to_seed_first_generation 10 \
    --diversity_seed_depreciation_per_gen 10 \
    --num_generations 5 \
    --mgltools_directory /PATH_TO/mgltools_x86_64Linux2_1.5.6/ \
    --number_of_processors -1 \
    --scoring_choice VINA \
    --LipinskiLenientFilter \
    --start_a_new_run \
    --rxn_library ClickChem \
    --selector_choice Rank_Selector \
    --dock_choice VinaDocking \
    --max_variants_per_compound 5 \
    --redock_elite_from_previous_gen False \
    --generate_plot True \
    --reduce_files_sizes True \
    --use_docked_source_compounds True \
    >  /PATH_TO/OUTPUT/text_file.txt 2>  /PATH_TO/OUTPUT/text_errormessage_file.txt

These above models only produce molecules, to evalute these molecules with docking and heuristic oracles, using following command:

python evaluation.py --smiles_path [your path] --pdb [your pdb] --model [model name]

For the rest of models that are under PMO, we use the following command to generate, note that you should running under TDC enviornment:

oracle_array=('1iep_docking' '3eml_docking' '3ny8_docking' '4rlu_docking' '4unn_docking' '5mo4_docking' '7l11_docking')

for oralce in ${oracle_array[@]}
python -u run.py [model name] --task production --n_runs 1 --max_oracle_calls 1000 --oracles ${oralce}

After generation, you could use mol_opt_process to convert the generated yaml file to csv file and evaluate the heuristic oracles.

To know the statistics of the docking or property score, you can use following code:

python results_compare.py --eval_folder_path [your generated result] --pdb_list [your pdb list] --file_type [docking or property] --output_folder [your outdir]