Active Learning in Drug Discovery
Installation
We recommend creating a conda environment to manage the dependencies. Assumes Anaconda installation. Clone this repository:
git clone https://github.com/gitter-lab/active-learning-drug-discovery.git
cd active-learning-drug-discovery
Setup the active_learning_dd
conda environment using the conda_env.yml
file:
conda env create -f conda_env.yml
conda activate active_learning_dd
If you do not want GPU support, you can replace conda_env.yml
with conda_cpu_env.yml
.
Finally, install active_learning_dd
with pip
:
pip install -e .
Now check the installation is working correctly by running the sample data test:
cd chtc_runners
python sample_data_runner.py \
--pipeline_params_json_file=../param_configs/sample_data_config.json \
--hyperparams_json_file=../param_configs/experiment_PstP_hyperparams/sampled_hyparams/ClusterBasedWCSelector_609.json \
--iter_max=5 \
--no-precompute_dissimilarity_matrix \
--initial_dataset_file=../datasets/sample_data/training_data/iter_0.csv.gz
You should see the following last prompt:
Finished testing sample dataset. Verified that hashed selection matches stored hash.
datasets
The datasets used in this study are: PriA-SSB target, 107 PubChem BioAssay targets, and PstP target.
The datasets will be uploaded to Zenodo in the near future.
The repository also contains a small dataset for testing: datasets/sample_data/
.
active_learning_dd
The active_learning_dd
subdirectory contains the main codebase for the iterative batched screening components.
Consult the README in that subdirectory for details.
param_configs
This subdirectory contains json config files for strategies and experiments used in the thesis document. Consult the README in that subdirectory for details.
analysis_notebooks
This subdirectory contains Jupyter notebooks that preprocess the datasets, debug methods, analyze the results, and produce result images.
runner scripts
chtc_runners/
contains runner scripts for the experiments in the thesis document.
chtc_runners/simulation_runner.py
can be used as a starting template for your own runner script.
chtc_runners/simulation_utils.py
contains helper functions for pre- and post-processing iteration selections for retrospective experiments.
Consult the README in that subdirectory for details.
Implemented Iterative Strategies
The following are the currently implemented strategies in active_learning_dd/next_batch_selector/
(see thesis document and hyperapameter examples in param_configs/
):
-
ClusterBasedWeightSelector (CBWS): assigns exploitation-exploration weights to every cluster, splits the budget between exploit-explore, then select compounds from most exploitable clusters, followed by selecting most explorable clusters.
-
ClusterBasedRandom: randomly samples a cluster, then randomly samples compounds from within clusters.
-
InstanceBasedRandom: randomly samples compounds from the pool.
-
ClusterBasedDissimilar: samples clusters dissimilarly according to a dissimilarity measure which is by default fingerprint based.
-
InstanceBasedDissimilar: samples compounds dissimilarly from the pool.
-
MABSelector: Upper-Confidence-Bound (UCB) style solution from Multi-Armed Bandits (MAB). Assigns every cluster an upper-bound estimate of the reward that is a combination of a exploitation term and an exploration term. Samples clusters with the highest rewards.