Code for paper Implications of Additivity and Non-additivity for Machine Learning and Deep Learning Models in Drug Design

This repository contains code to run hyper-parameter optimization for RF, SVR, XGBoost, and PLS algorithms. The data is not included in this repository.

Directories

root - Python and shell scripts for running hyper-parameter optimization.
notebooks - Jupyter Notebooks for splitting data and computing test scores.
data-initial - Initial data, not included in this repo.
data - Main data: random split of initial data into train and test data, not included in this repo.
downsampled-10-percent - Down-sampled 10% of main data, not included in this repo.
optuna-storage - Auxiliary storage for optuna library to track hyper-parameter optimization progress, not included in this repo.
best-models - Models with best hyper-parameters, not included in this repo.
pred_values - Predicted vs expected values for models with best hyper-parameters.
fill-gaps-configs - build configurations for best found hyper-parameters for "filling gaps" (see paper).

Workflow

First, split initial data into training and test datasets using Jupyter Notebook.
Then run all 32 optimization jobs using script submit_all_to_slurm_on_full_data.sh.
If any of the jobs fails:
- Prepare down-sampled data using Jupyter Notebook.
- Re-submit failed optimization jobs using down-sampled data.
- Prepare "fill-gaps" build configurations for the best-found hyper-parameters using Jupyter Notebook.
- Submit "fill-gaps" build jobs.
Then prepare summary table using Jupyter Notebook.

Dependencies

This code uses QPTUNA to set up hyper-parameter optimization.

Optimization jobs are started using SLURM, but they can be started without SLURM too.

License

Apache 2.0.

Contributors