Active Learning in Drug Discovery

Installation

We recommend creating a conda environment to manage the dependencies. Assumes Anaconda installation. Clone this repository:

git clone https://github.com/gitter-lab/active-learning-drug-discovery.git
cd active-learning-drug-discovery

Setup the active_learning_dd conda environment using the conda_env.yml file:

conda env create -f conda_env.yml
conda activate active_learning_dd

If you do not want GPU support, you can replace conda_env.yml with conda_cpu_env.yml.

Finally, install active_learning_dd with pip:

pip install -e .

Now check the installation is working correctly by running the sample data test:

cd chtc_runners
python sample_data_runner.py \
        --pipeline_params_json_file=../param_configs/sample_data_config.json \
        --hyperparams_json_file=../param_configs/experiment_PstP_hyperparams/sampled_hyparams/ClusterBasedWCSelector_609.json \
        --iter_max=5 \
        --no-precompute_dissimilarity_matrix \
        --initial_dataset_file=../datasets/sample_data/training_data/iter_0.csv.gz

You should see the following last prompt:

Finished testing sample dataset. Verified that hashed selection matches stored hash.

datasets

The datasets used in this study are: PriA-SSB target, 107 PubChem BioAssay targets, and PstP target.

The datasets will be uploaded to Zenodo in the near future.

The repository also contains a small dataset for testing: datasets/sample_data/.

active_learning_dd

The active_learning_dd subdirectory contains the main codebase for the iterative batched screening components. Consult the README in that subdirectory for details.

param_configs

This subdirectory contains json config files for strategies and experiments used in the thesis document. Consult the README in that subdirectory for details.

analysis_notebooks

This subdirectory contains Jupyter notebooks that preprocess the datasets, debug methods, analyze the results, and produce result images.

runner scripts

chtc_runners/ contains runner scripts for the experiments in the thesis document. chtc_runners/simulation_runner.py can be used as a starting template for your own runner script. chtc_runners/simulation_utils.py contains helper functions for pre- and post-processing iteration selections for retrospective experiments. Consult the README in that subdirectory for details.

Implemented Iterative Strategies

The following are the currently implemented strategies in active_learning_dd/next_batch_selector/ (see thesis document and hyperapameter examples in param_configs/):

ClusterBasedWeightSelector (CBWS): assigns exploitation-exploration weights to every cluster, splits the budget between exploit-explore, then select compounds from most exploitable clusters, followed by selecting most explorable clusters.
ClusterBasedRandom: randomly samples a cluster, then randomly samples compounds from within clusters.
InstanceBasedRandom: randomly samples compounds from the pool.
ClusterBasedDissimilar: samples clusters dissimilarly according to a dissimilarity measure which is by default fingerprint based.
InstanceBasedDissimilar: samples compounds dissimilarly from the pool.
MABSelector: Upper-Confidence-Bound (UCB) style solution from Multi-Armed Bandits (MAB). Assigns every cluster an upper-bound estimate of the reward that is a combination of a exploitation term and an exploration term. Samples clusters with the highest rewards.