mol_opt: A Benchmark for Practical Molecular Optimization
This repository hosts an open-source benchmark for Practical Molecular Optimization (PMO), to facilitate the transparent and reproducible evaluation of algorithmic advances in molecular optimization. This repository supports 25 molecular design algorithms on 23 tasks with a particular focus on sample efficiency (oracle calls). The preprint version of the paper is available at https://arxiv.org/pdf/2206.12411.pdf
Installation
conda create -n molopt python=3.7
conda activate molopt
pip install torch
pip install PyTDC
pip install PyYAML
conda install -c rdkit rdkit
We recommend to use PyTorch 1.10.2 and PyTDC 0.3.6.
Then we can activate conda via following command.
conda activate molopt
29 Methods
Based the ML methodologies, all the methods are categorized into:
- virtual screening
- screening randomly search ZINC database.
- molpal uses molecular property predictor to prioritize the high-scored molecules.
- GA (genetic algorithm)
- graph_ga based on molecular graph.
- smiles_ga based on SMILES
- selfies_ga based on SELFIES
- stoned based on SELFIES
- synnet based on synthesis
- VAE (variational auto-encoder)
- smiles_vae based on SMILES
- selfies_vae based on SELFIES
- jt_vae based on junction tree (fragment as building block)
- dog_ae based on synthesis
- BO (Bayesian optimization)
- gpbo
- RL (reinforcement learning)
- reinvent
- reinvent_selfies
- graphinvent
- moldqn
- smiles_aug_mem
- smiles_bar
- HC (hill climbing)
- smiles_lstm_hc is SMILES-level HC.
- smiles_ahc is SMILES-level augmented HC.
- selfies_lstm_hc is SELFIES-level HC
- mimosa is graph-level HC
- dog_gen is synthesis based HC
- gradient (gradient ascent)
- dst is based molecular graph.
- pasithea is based on SELFIES.
- SBM (score-based modeling)
- gflownet
- gflownet_al
- mars
time
is the average rough clock time for a single run in our benchmark and do not involve the time for pretraining and data preprocess.
We have processed the data, pretrained the model. Both are available in the repository.
assembly |
additional package |
time |
requires_gpu |
|
---|---|---|---|---|
screening | - | - | 2 min | no |
molpal | - | ray, tensorflow, ConfigArgParse, pytorch-lightning | 1 hour | no |
graph_ga | fragment | joblib | 3 min | no |
smiles_ga | SMILES | joblib, nltk | 2 min | no |
stoned | SELFIES | - | 3 min | no |
selfies_ga | SELFIES | selfies | 20 min | no |
graph_mcts | atom | - | 2 min | no |
smiles_lstm_hc | SMILES | guacamol | 4 min | no |
smiles_ahc | SMILES | 4 min | no | |
selfies_lstm_hc | SELFIES | guacamol, selfies | 4 min | yes |
smiles_vae | SMILES | botorch | 20 min | yes |
selfies_vae | SELFIES | botorch, selfies | 20 min | yes |
jt_vae | fragment | botorch | 20 min | yes |
gpbo | fragment | botorch, networkx | 15 min | no |
reinvent | SMILES | pexpect, bokeh | 2 min | yes |
reinvent_selfies | SELFIES | selfies, pexpect, bokeh | 3 min | yes |
smiles_aug_mem | SMILES | reinvent-models==0.0.15rc1 | 2 min | yes |
smiles_bar | SMILES | reinvent-models==0.0.15rc1 | 2 min | yes |
reinvent_selfies | SELFIES | selfies | 3 min | yes |
moldqn | atom | networks, requests | 60 min | yes |
mimosa | fragment | - | 10 min | yes |
mars | fragment | chemprop, networkx, dgl | 20 min | yes |
dog_gen | synthesis | extra conda | 120 min | yes |
dog_ae | synthesis | extra conda | 50 min | yes |
synnet | synthesis | dgl, pytorch_lightning, networkx, matplotlib | 2-5 hours | yes |
pasithea | SELFIES | selfies, matplotlib | 50 min | yes |
dst | fragment | - | 120 min | no |
gflownet | fragment | torch_{geometric,sparse,cluster}, pdb | 30 min | yes |
gflownet_al | fragment | torch_{geometric,sparse,cluster}, pdb | 30 min | yes |
Run with one-line code
There are three types of runs defined in our code base:
simple
: A single run for testing purposes for each oracle, is the defualt.production
: Multiple independent runs with various random seeds for each oracle.tune
: A hyper-parameter tuning over the search space defined inmain/MODEL_NAME/hparam_tune.yaml
for each oracle.
## specify multiple random seeds
python run.py MODEL_NAME --seed 0 1 2
## run 5 runs with different random seeds with specific oracle
python run.py MODEL_NAME --task production --n_runs 5 --oracles qed
## run a hyper-parameter tuning starting from smiles in a smi_file, 30 runs in total
python run.py MODEL_NAME --task tune --n_runs 30 --smi_file XX --other_args XX
MODEL_NAME
are listed in the table above.
Multi-Objective Optimization
Multi-objective optimization is implemented in multiobjective
branch. We use "+" to connect multiple properties, please see the command line below.
python run.py MODEL_NAME --oracles qed+jnk3
Hyperparameters
We separate hyperparameters for task-level control, defined from argparse
, and algorithm-level control, defined from hparam_default.yaml
. There is no clear boundary for them, but we recommend one keep all hyperparameters in the self._optimize
function as task-level.
- running hyperparameter: parser argument.
- default model hyperparameter:
hparam_default.yaml
- tuning model hyperparameter:
hparam_tune.yaml
For algorithm-level hyperparameters, we adopt the stratforward yaml file format. One should define a default set of hyper-parameters in main/MODEL_NAME/hparam_default.yaml
:
population_size: 50
offspring_size: 100
mutation_rate: 0.02
patience: 5
max_generations: 1000
And the search space for hyper-parameter tuning in main/MODEL_NAME/hparam_tune.yaml
:
name: graph_ga
method: random
metric:
goal: maximize
name: avg_top100
parameters:
population_size:
values: [20, 40, 50, 60, 80, 100, 150, 200]
offspring_size:
values: [50, 100, 200, 300]
mutation_rate:
distribution: uniform
min: 0
max: 0.1
patience:
value: 5
max_generations:
value: 1000
Contribute
Our repository is an open-source initiative. To update a better set of parameters or incldue your model in out benchmark, check our Contribution Guidelines!