DeepChem
DeepChem aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, and quantum chemistry. DeepChem is a package developed by the Pande group at Stanford and originally created by Bharath Ramsundar.
Table of contents:
- Requirements
- Installation from Source
- FAQ
- Getting Started
- Contributing to DeepChem
- DeepChem Publications
- Examples
- About Us
Requirements
Installation
Installation from source is the only currently supported format. deepchem
currently supports both Python 2.7 and Python 3.5, but is not supported on any OS'es except 64 bit linux. Please make sure you follow the directions below precisely. While you may already have system versions of some of these packages, there is no guarantee that deepchem
will work with alternate versions than those specified below.
Full Anaconda distribution
-
Download the 64-bit Python 2.7 or Python 3.5 versions of Anaconda for linux here.
Follow the installation instructions
-
openbabel
conda install -c omnia openbabel=2.4.0
-
rdkit
conda install -c rdkit rdkit
-
joblib
conda install joblib
-
six
pip install six
-
mdtraj
conda install -c omnia mdtraj
-
tensorflow
: Installingtensorflow
on older versions of Linux (which have glibc < 2.17) can be very challenging. For these older Linux versions, contact your local sysadmin to work out a custom installation. If your version of Linux is recent, then the following command will work:pip install tensorflow-gpu==0.12.1
-
deepchem
: Clone thedeepchem
github repo:git clone https://github.com/deepchem/deepchem.git
cd
into thedeepchem
directory and executepython setup.py install
-
To run test suite, install
nosetests
:pip install nose
Make sure that the correct version of
nosetests
is active by runningwhich nosetests
You might need to uninstall a system install of
nosetests
if there is a conflict. -
If installation has been successful, all tests in test suite should pass:
nosetests -v deepchem --nologcapture
Note that the full test-suite uses up a fair amount of memory. Try running tests for one submodule at a time if memory proves an issue.
Using a conda environment
Alternatively, you can install deepchem in a new conda environment using the conda commands in scripts/install_deepchem_conda.sh
bash scripts/install_deepchem_conda.sh deepchem
pip install tensorflow-gpu==0.12.1 # If you want GPU support
git clone https://github.com/deepchem/deepchem.git # Clone deepchem source code from GitHub
cd deepchem
python setup.py install # Manual install
nosetests -v deepchem --nologcapture # Run tests
This creates a new conda environment deepchem
and installs in it the dependencies that
are needed. To access it, use the source activate deepchem
command.
Check this link for more information about
the benefits and usage of conda environments. Warning: Segmentation faults can still happen
via this installation procedure.
FAQ
-
Question: I'm seeing some failures in my test suite having to do with MKL
Intel MKL FATAL ERROR: Cannot load libmkl_avx.so or libmkl_def.so.
Answer: This is a general issue with the newest version of
scikit-learn
enabling MKL by default. This doesn't play well with many linux systems. See BVLC/caffe#3884 for discussions. The following seems to fix the issueconda install nomkl numpy scipy scikit-learn numexpr conda remove mkl mkl-service
-
Question: The test suite is core-dumping for me. What's up?
[rbharath]$ nosetests -v deepchem --nologcapture Illegal instruction (core dumped)
Answer: This is often due to
openbabel
issues on older linux systems. Openipython
and run the followingIn [1]: import openbabel as ob
If you see a core-dump, then it's a sign there's an issue with your
openbabel
install. Try reinstallingopenbabel
from source for your machine.
Getting Started
The first step to getting started is looking at the examples in the examples/
directory. Try running some of these examples on your system and verify that the models train successfully. Afterwards, to apply deepchem
to a new problem, try starting from one of the existing examples and modifying it step by step to work with your new use-case.
Input Formats
Accepted input formats for deepchem include csv, pkl.gz, and sdf files. For example, with a csv input, in order to build models, we expect the following columns to have entries for each row in the csv file.
- A column containing SMILES strings [1].
- A column containing an experimental measurement.
- (Optional) A column containing a unique compound identifier.
Here's an example of a potential input file.
Compound ID | measured log solubility in mols per litre | smiles |
---|---|---|
benzothiazole | -1.5 | c2ccc1scnc1c2 |
Here the "smiles" column contains the SMILES string, the "measured log solubility in mols per litre" contains the experimental measurement and "Compound ID" contains the unique compound identifier.
[2] Anderson, Eric, Gilman D. Veith, and David Weininger. "SMILES, a line notation and computerized interpreter for chemical structures." US Environmental Protection Agency, Environmental Research Laboratory, 1987.
Data Featurization
Most machine learning algorithms require that input data form vectors.
However, input data for drug-discovery datasets routinely come in the
format of lists of molecules and associated experimental readouts. To
transform lists of molecules into vectors, we need to subclasses of DeepChem
loader class dc.data.DataLoader
such as dc.data.CSVLoader
or
dc.data.SDFLoader
. Users can subclass dc.data.DataLoader
to
load arbitrary file formats. All loaders must be
passed a dc.feat.Featurizer
object. DeepChem provides a number of
different subclasses of dc.feat.Featurizer
for convenience.
Performances
- Classification
Index splitting
Dataset | Model | Train score/ROC-AUC | Valid score/ROC-AUC |
---|---|---|---|
tox21 | logistic regression | 0.903 | 0.705 |
Random Forest | 0.999 | 0.733 | |
IRV | 0.811 | 0.767 | |
Multitask network | 0.856 | 0.763 | |
robust MT-NN | 0.857 | 0.767 | |
graph convolution | 0.872 | 0.798 | |
muv | logistic regression | 0.963 | 0.766 |
Multitask network | 0.904 | 0.764 | |
robust MT-NN | 0.934 | 0.781 | |
graph convolution | 0.840 | 0.823 | |
pcba | logistic regression | 0.809 | 0.776 |
Multitask network | 0.826 | 0.802 | |
robust MT-NN | 0.809 | 0.783 | |
graph convolution | 0.876 | 0.852 | |
sider | logistic regression | 0.933 | 0.620 |
Random Forest | 0.999 | 0.670 | |
IRV | 0.649 | 0.642 | |
Multitask network | 0.775 | 0.634 | |
robust MT-NN | 0.803 | 0.632 | |
graph convolution | 0.708 | 0.594 | |
toxcast | logistic regression | 0.721 | 0.575 |
Multitask network | 0.830 | 0.678 | |
robust MT-NN | 0.825 | 0.680 | |
graph convolution | 0.821 | 0.720 | |
clintox | logistic regression | 0.967 | 0.676 |
Random Forest | 0.995 | 0.776 | |
IRV | 0.763 | 0.814 | |
Multitask network | 0.934 | 0.830 | |
robust MT-NN | 0.949 | 0.827 | |
graph convolution | 0.946 | 0.860 | |
hiv | logistic regression | 0.864 | 0.739 |
Random Forest | 0.999 | 0.720 | |
IRV | 0.841 | 0.724 | |
Multitask network | 0.761 | 0.652 | |
robust MT-NN | 0.780 | 0.708 | |
graph convolution | 0.876 | 0.779 |
Random splitting
Dataset | Model | Train score/ROC-AUC | Valid score/ROC-AUC |
---|---|---|---|
tox21 | logistic regression | 0.902 | 0.715 |
Random Forest | 0.999 | 0.764 | |
IRV | 0.808 | 0.767 | |
Multitask network | 0.844 | 0.795 | |
robust MT-NN | 0.855 | 0.773 | |
graph convolution | 0.865 | 0.827 | |
muv | logistic regression | 0.957 | 0.719 |
Multitask network | 0.902 | 0.734 | |
robust MT-NN | 0.933 | 0.732 | |
graph convolution | 0.860 | 0.730 | |
pcba | logistic regression | 0.808 | 0.776 |
Multitask network | 0.811 | 0.778 | |
robust MT-NN | 0.811 | 0.771 | |
graph convolution | 0.872 | 0.844 | |
sider | logistic regression | 0.929 | 0.656 |
Random Forest | 0.999 | 0.665 | |
IRV | 0.648 | 0.596 | |
Multitask network | 0.777 | 0.655 | |
robust MT-NN | 0.804 | 0.630 | |
graph convolution | 0.705 | 0.618 | |
toxcast | logistic regression | 0.725 | 0.586 |
Multitask network | 0.836 | 0.684 | |
robust MT-NN | 0.822 | 0.681 | |
graph convolution | 0.820 | 0.717 | |
clintox | logistic regression | 0.972 | 0.725 |
Random Forest | 0.997 | 0.670 | |
IRV | 0.809 | 0.846 | |
Multitask network | 0.951 | 0.834 | |
robust MT-NN | 0.959 | 0.830 | |
graph convolution | 0.975 | 0.876 | |
hiv | logistic regression | 0.860 | 0.806 |
Random Forest | 0.999 | 0.850 | |
IRV | 0.839 | 0.809 | |
Multitask network | 0.742 | 0.715 | |
robust MT-NN | 0.753 | 0.727 | |
graph convolution | 0.847 | 0.803 |
Scaffold splitting
Dataset | Model | Train score/ROC-AUC | Valid score/ROC-AUC |
---|---|---|---|
tox21 | logistic regression | 0.900 | 0.650 |
Random Forest | 0.999 | 0.629 | |
IRV | 0.823 | 0.708 | |
Multitask network | 0.863 | 0.703 | |
robust MT-NN | 0.861 | 0.710 | |
graph convolution | 0.885 | 0.732 | |
muv | logistic regression | 0.947 | 0.767 |
Multitask network | 0.899 | 0.762 | |
robust MT-NN | 0.944 | 0.726 | |
graph convolution | 0.872 | 0.795 | |
pcba | logistic regression | 0.810 | 0.742 |
Multitask network | 0.814 | 0.760 | |
robust MT-NN | 0.812 | 0.756 | |
graph convolution | 0.874 | 0.817 | |
sider | logistic regression | 0.926 | 0.592 |
Random Forest | 0.999 | 0.619 | |
IRV | 0.639 | 0.599 | |
Multitask network | 0.776 | 0.557 | |
robust MT-NN | 0.797 | 0.560 | |
graph convolution | 0.722 | 0.583 | |
toxcast | logistic regression | 0.716 | 0.492 |
Multitask network | 0.828 | 0.617 | |
robust MT-NN | 0.830 | 0.614 | |
graph convolution | 0.832 | 0.638 | |
clintox | logistic regression | 0.960 | 0.803 |
Random Forest | 0.993 | 0.735 | |
IRV | 0.793 | 0.718 | |
Multitask network | 0.947 | 0.862 | |
robust MT-NN | 0.953 | 0.890 | |
graph convolution | 0.957 | 0.823 | |
hiv | logistic regression | 0.858 | 0.798 |
Random Forest | 0.946 | 0.562 | |
IRV | 0.847 | 0.811 | |
Multitask network | 0.775 | 0.765 | |
robust MT-NN | 0.785 | 0.748 | |
graph convolution | 0.867 | 0.769 |
- Regression
Dataset | Model | Splitting | Train score/R2 | Valid score/R2 |
---|---|---|---|---|
delaney | Random Forest | Index | 0.953 | 0.626 |
NN regression | Index | 0.868 | 0.578 | |
graphconv regression | Index | 0.967 | 0.790 | |
Random Forest | Random | 0.951 | 0.684 | |
NN regression | Random | 0.865 | 0.574 | |
graphconv regression | Random | 0.964 | 0.782 | |
Random Forest | Scaffold | 0.953 | 0.284 | |
NN regression | Scaffold | 0.866 | 0.342 | |
graphconv regression | Scaffold | 0.967 | 0.606 | |
sampl | Random Forest | Index | 0.968 | 0.736 |
NN regression | Index | 0.917 | 0.764 | |
graphconv regression | Index | 0.982 | 0.864 | |
Random Forest | Random | 0.967 | 0.752 | |
NN regression | Random | 0.908 | 0.830 | |
graphconv regression | Random | 0.987 | 0.868 | |
Random Forest | Scaffold | 0.966 | 0.473 | |
NN regression | Scaffold | 0.891 | 0.217 | |
graphconv regression | Scaffold | 0.985 | 0.666 | |
nci | NN regression | Index | 0.171 | 0.062 |
graphconv regression | Index | 0.123 | 0.048 | |
NN regression | Random | 0.168 | 0.085 | |
graphconv regression | Random | 0.117 | 0.076 | |
NN regression | Scaffold | 0.180 | 0.052 | |
graphconv regression | Scaffold | 0.131 | 0.046 | |
pdbbind(core) | Random Forest | Random | 0.969 | 0.445 |
NN regression | Random | 0.973 | 0.494 | |
pdbbind(refined) | Random Forest | Random | 0.963 | 0.511 |
NN regression | Random | 0.987 | 0.503 | |
pdbbind(full) | Random Forest | Random | 0.965 | 0.493 |
NN regression | Random | 0.983 | 0.528 | |
chembl | MT-NN regression | Index | 0.443 | 0.427 |
MT-NN regression | Random | 0.464 | 0.434 | |
MT-NN regression | Scaffold | 0.484 | 0.361 | |
qm7 | NN regression | Index | 0.997 | 0.986 |
NN regression | Random | 0.999 | 0.999 | |
NN regression | Stratified | 0.999 | 0.999 | |
qm7b | MT-NN regression | Index | 0.931 | 0.803 |
MT-NN regression | Random | 0.923 | 0.884 | |
MT-NN regression | Stratified | 0.934 | 0.884 | |
kaggle | MT-NN regression | User-defined | 0.748 | 0.452 |
Dataset | Model | Splitting | Train score/MAE(kcal/mol) | Valid score/MAE(kcal/mol) |
---|---|---|---|---|
qm7 | NN regression | Index | 11.0 | 12.0 |
NN regression | Random | 7.12 | 7.53 | |
NN regression | Stratified | 6.61 | 7.34 |
- General features
Number of tasks and examples in the datasets
Dataset | N(tasks) | N(samples) |
---|---|---|
tox21 | 12 | 8014 |
muv | 17 | 93127 |
pcba | 128 | 439863 |
sider | 27 | 1427 |
toxcast | 617 | 8615 |
clintox | 2 | 1491 |
hiv | 1 | 41913 |
delaney | 1 | 1128 |
sampl | 1 | 643 |
kaggle | 15 | 173065 |
nci | 60 | 19127 |
pdbbind(core) | 1 | 195 |
pdbbind(refined) | 1 | 3706 |
pdbbind(full) | 1 | 11908 |
chembl(5thresh) | 691 | 23871 |
qm7 | 1 | 7165 |
qm7b | 14 | 7211 |
Time needed for benchmark test(~20h in total)
Dataset | Model | Time(loading)/s | Time(running)/s |
---|---|---|---|
tox21 | logistic regression | 30 | 60 |
Multitask network | 30 | 60 | |
robust MT-NN | 30 | 90 | |
random forest | 30 | 6000 | |
IRV | 30 | 650 | |
graph convolution | 40 | 160 | |
muv | logistic regression | 600 | 450 |
Multitask network | 600 | 400 | |
robust MT-NN | 600 | 550 | |
graph convolution | 800 | 1800 | |
pcba | logistic regression | 1800 | 10000 |
Multitask network | 1800 | 9000 | |
robust MT-NN | 1800 | 14000 | |
graph convolution | 2200 | 14000 | |
sider | logistic regression | 15 | 80 |
Multitask network | 15 | 75 | |
robust MT-NN | 15 | 150 | |
random forest | 15 | 2200 | |
IRV | 15 | 150 | |
graph convolution | 20 | 50 | |
toxcast | logistic regression | 80 | 2600 |
Multitask network | 80 | 2300 | |
robust MT-NN | 80 | 4000 | |
graph convolution | 80 | 900 | |
clintox | logistic regression | 15 | 10 |
Multitask network | 15 | 20 | |
robust MT-NN | 15 | 30 | |
random forest | 15 | 200 | |
IRV | 15 | 10 | |
graph convolution | 20 | 130 | |
hiv | logistic regression | 180 | 40 |
Multitask network | 180 | 350 | |
robust MT-NN | 180 | 450 | |
random forest | 180 | 2800 | |
IRV | 180 | 200 | |
graph convolution | 180 | 1300 | |
delaney | MT-NN regression | 10 | 40 |
graphconv regression | 10 | 40 | |
random forest | 10 | 30 | |
sampl | MT-NN regression | 10 | 30 |
graphconv regression | 10 | 40 | |
random forest | 10 | 20 | |
nci | MT-NN regression | 400 | 1200 |
graphconv regression | 400 | 2500 | |
pdbbind(core) | MT-NN regression | 0(featurized) | 30 |
pdbbind(refined) | MT-NN regression | 0(featurized) | 40 |
pdbbind(full) | MT-NN regression | 0(featurized) | 60 |
chembl | MT-NN regression | 200 | 9000 |
qm7 | MT-NN regression | 10 | 400 |
qm7b | MT-NN regression | 10 | 600 |
kaggle | MT-NN regression | 2200 | 3200 |
Contributing to DeepChem
We actively encourage community contributions to DeepChem. The first place to start getting involved is by running our examples locally. Afterwards, we encourage contributors to give a shot to improving our documentation. While we take effort to provide good docs, there's plenty of room for improvement. All docs are hosted on Github, either in this README.md
file, or in the docs/
directory.
Once you've got a sense of how the package works, we encourage the use of Github issues to discuss more complex changes, raise requests for new features or propose changes to the global architecture of DeepChem. Once consensus is reached on the issue, please submit a PR with proposed modifications. All contributed code to DeepChem will be reviewed by a member of the DeepChem team, so please make sure your code style and documentation style match our guidelines!
Code Style Guidelines
DeepChem uses yapf to autoformat code. We created a git pre-commit hook to make this process easier.
cp devtools/travis-ci/pre-commit .git/hooks
pip install yapf==0.16.0
Documentation Style Guidelines
DeepChem uses NumPy style documentation. Please follow these conventions when documenting code, since we use Sphinx+Napoleon to automatically generate docs on deepchem.io.
Gitter
Join us on gitter at https://gitter.im/deepchem/Lobby. Probably the easiest place to ask simple questions or float requests for new features.
DeepChem Publications
- Computational Modeling of β-secretase 1 (BACE-1) Inhibitors using Ligand Based Approaches
- Low Data Drug Discovery with One-shot Learning
About Us
DeepChem is a package by the Pande group at Stanford. DeepChem was originally created by Bharath Ramsundar, and has grown through the contributions of a number of undergraduate, graduate, and postdoctoral researchers working with the Pande lab.