Source code for the enhanced GEGL framework based on Guiding Deep Molecular Optimization with Genetic Exploration. GEGL is a powerful generative framework combining Reinforcement Learning, Genetic algorithms and Deep Learning. This repo is heavily inspired by the original GEGL source code. This package contains the following (non-exhaustive) enhancements to the original work:
- Possibility to use Transformers instead of a LSTM-models for the neural apprentice.
- Possibility to use a genetic expert based on the SELFIES chemical-path crossover as proposed in STONED.
- Addition of an explainer models (GCN, D-MPNN) which generates attributions using the CAM method for generated molecules.
- Fragment-based genetic expert which generates a fragment library based on the explainer model and recombines them in the SELFIES space to propose new molecules.
- Support for multiple Genetic Experts sharing some alloted sampling budget. The query-size for each expert can be either fixed or dynamicaly recomputed at each optimization step.
The library can be installed via the following commands:
make create_env
conda activate egegl
make install_lib
Datasets (ZINC/Guacamol) and pretrained models (LSTM on ZINC/Guacamol) can be easily obtained via:
make dl_zinc
make dl_guacamol
make dl_zinc_pretrain
make dl_guacamol_pretrain
Alternatively, models can be pretrained on the downloaded datasets using the pretraining script:
CUDA_VISIBLE_DEVICES=<gpu_id> python scripts/run_pretrain.py --dataset zinc --use_cuda --model_type LSTM
Please see further available options in the script scripts/run_pretrain.py
.
Optimizations can be run either via the scripts/run_optimization.py
script or some customized code inspired by the former.
If you would like to use the premade script, the following arguments can currently be passed:
save_root
(str): Where to save the final neural apprentice and explainer model if used. Defaults to./results/
benchmark_id
(int): The ID of the benchmark task. Seeegegl/scoring/benchmarks.py
for the list of available benchmarks. Default to28
which correponds to the plogp task.dataset
(str): Dataset to consider for the task. Needs to be eitherzinc
orguacamol
and defaults tozinc
.dataset_path
(str): Path to the dataset. Defaults to./data/datasets/zinc/all.txt
.model_type
(str): Model type for the neural apprentice. Must be eitherLSTM
orTransformer
. Defaults toLSTM
.explainer_type
(str): Model type for the explainer model. Must be eitherGCN
orDMPNN
. Defaults toGCN
.max_smiles_length
(int): Maximum length for the generated SMILES. Defaults to100
.apprentice_load_dir
(str): Path to the pretrained neural apprentice. Defaults to./data/pretrained_models/original_benchmarks/zinc
.explainer_load_dir
(str): Path to the pretrained explainer model. Defaults toNone
.genetic_experts
(List(str)): List of genetic experts to be used. Defaults to["SELFIES", "SMILES", "ATTR"]
.logger_type
(str): Logger type for the run. Can be eitherNeptune
orCommandLine
. Please ensure that you have a neptune account and have set the credentials correctly in the script first if you would like to use this logger.project_qualified_name
(str): For the Neptune logger only, sets the name for neptune logging.learning_rate
(float): Learning rate for the apprentice and explainer model. Defaults to1e-3
.mutation_initial_rate
(float): Initial mutation rate for the genetic experts. Defaults to1e-2
.num_steps
(int): Number of optimization steps. Defaults to200
.num_keep
(int): Number of molecules to keep in each priority queue. Defaults to1024
.max_sampling_batch_size
(int): Maximum sampling batch size during sampling of the neural apprentice. Defaults to1024
apprentice_sampling_batch_size
(int): Number of molecules to sample by the apprentice at each round. Defaults to8192
.expert_sampling_batch_size
(int): Number of molecules to sample by the experts at each round. Defaults to8192
.apprentice_training_batch_size
(int): Batch size of the data during imitation training of the apprentice. Defaults to256
.apprentice_training_steps
(int): The number of training steps of the apprentice during imitation training. Defaults to8
.num_jobs
(int): Number of parallel jobs during expert sampling.record_filtered
(bool): Activates post-hoc filtering of the molecules as described in the original paper. Defaults toFalse
.use_cuda
(bool): Activates optimization on the specified CUDA device.use_frozen_explainer
(bool): Whether to freeze the explainer during optimization. Defaults toFalse
.seed
(int): Sets the random seed for certain libraries. Defaults to404
.
An example command to launch optimization can be:
CUDA_VISIBLE_DEVICES=0 python scripts/run_optimization.py \
--model_type LSTM --benchmark_id 28 --use_cuda \
--dataset zinc \
--apprentice_load_dir ./data/pretrained_models/original_benchmarks/zinc \
--max_smiles_length 81 --num_jobs 8 --genetic_expert SMILES
Code was tested on
- Python >= 3.6.13
- torch == 1.8.1
- cuda == 10.1
- rdkit == 2020.09
Copyright (c) 2021 Elix, Inc.
The following source code cannot be used for commercial use but can be used freely otherwise. Please refer to the added LICENSE.txt
file for more details.