Virtual screening on PriA-SSB and RMI-FANCM with the LifeChem library

Citation

If you use this software or the new high-throughput screening data, please cite:

Shengchao Liu⁺, Moayad Alnammi⁺, Spencer S. Ericksen, Andrew F. Voter, Gene E. Ananiev, James L. Keck, F. Michael Hoffmann, Scott A. Wildman, Anthony Gitter. Practical model selection for prospective virtual screening. Journal of Chemical Information and Modeling 2018.

⁺ denotes co-first authors.

Installation

We recommend creating a conda environment to manage the dependencies. First install Anaconda if it is not already installed. Then, clone this pria_lifechem repository:

git clone https://github.com/gitter-lab/pria_lifechem.git
cd pria_lifechem

Create and activate a conda environment named pria using the conda_env.yml file:

conda env create -f conda_env.yml
source activate pria

Finally, install pria_lifechem with pip.

pip install -e .

To use the package again later, use source activate pria to re-activate the conda environment. The package is only currently supported for Linux. The conda environment provided does not include a Theano GPU backend. To use Theano with a GPU, see the Theano guide.

The IRV models were trained using a customized fork of DeepChem. See the separate installation instructions in that repository.

Note: Random Forest results in the paper were obtained using Python 3.4 and sklearn=0.18.1. The random forest code is still compatible with conda_env.yml, but the results may differ due to different versions.

dataset

The dataset subdirectory contains a description of the expected file format and an example dataset that has been split into five folds.

The complete high-throughput screening data are available on PubChem (AID:1272365 and AID:1159607). Pre-processed, merged versions of the data are available on Zenodo (doi:10.5281/zenodo.1411506). The Zenodo files are:

pria_rmi_cv.tar.gz: The LifeChem compounds used for cross validation with PriA-SSB and RMI-FANCM split into five folds.
pria_rmi_pcba_cv.tar.gz: These same compounds merged with 128 tasks from PubChem split into five folds.
pria_prospective.csv.gz: The separate LifeChem compounds used for prospective testing with PriA-SSB.

pria_lifechem

The pria_lifechem subdirectory contains:

scripts to prepare and load datasets
a script to evaluate trained models
a models subdirectory with code and instructions for training models
an analysis subdirectory to reproduce figures from the manuscript

json

The json subdirectory contains json config files with the model hyperparameters.

output

The output subdirectory contains scripts for post-processing the output files.