/solubility

My (small) research project in solubility of drug-like molecules

Primary LanguageJupyter NotebookBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Solubility Challange

Notice: This is research code that will not necessarily be maintained in the future. The code is under development so make sure you are using the most recent version. I welcome bug reports and PRs but make no guarantees about fixes or responses.

Table of contents

Solubility
Raw data
Datasets information
Papers
Results and some comments
License

Solubility

Intrinsic solubility (water solubility): solubility of non-charged molecules, i.e. free acid and base free form. It is required that the solubility of the compound is determined in the presence of its solid substance.

The project is motivated by the challenge and the following blog post.

(Upate 01/02/2021): The analysis of the challenge results have been recently published.

Data preparation and model training

In this section we discuss how to (1) prepare data, (2) train models, and (3) make challenge predictions with the code in this repository:

  • Process raw data, strore it in a standarized format, and exclude test-cases (stored in test_32.smi and test_100.smi) from training:
python prepare_data.py

All the SMILES are first canonicalized and standardized before the master training data set is created. To change the list of files used for the training set, comment out the lines in the process() and unique() functions in prepare_data.py.

For this set-up, the challenge datasets are our external test sets, and the trainig sets is further split (see below) into 5 cross-fold validation sets.

  • Train the RFPredictor (or other model) on a dataset excluding Set-100 (solubility.uniq.no-in-100.smi), and save cross-fold validation metrics into a file (rf-no-in-100.dat):
python rf.py --input ../../data/training/solubility.uniq.no-in-100.smi \
             --output ../../results/rf-no-in-100.dat

To run a model training in y-randomization mode (as a baseline), add --y_rand to the command options.

  • Train the EnsemblePredictor and make a prediction for Set-100:
python make_challenge_prediction.py --model ensemble \
                                    --train_file ../data/training/solubility.uniq.no-in-100.smi \
                                    --test_file ../data/test/test_100.smi \
                                    --out_file ../data/results/ensemble.test_100.preds.dat
  • Check out your challenge predictions and compare them to the values that could be found in public sources:
python estimate_accuracy.py ../data/test/test_100.with.gse.smi ../results//ensemble.test_100.preds.dat ../data/test/test_100.in-train.smi

Datasets

Note: The training dataset (i.e. all unique SMILES extracted from the raw data) was only mildly curated: (1) filtered out compounds with MolW > 600 or MolW < 60 (2) if multiple measurements are available, compounds with differences larger than 1 log unit or having the opposite signs (e.g. logS0=3 and logS0=-3) were excluded (3) OCHEM db is excluded completely (because of too many dubious datapoints).

Dataset Do I trust it? Comments
A.2019.ADMET_DMPK (+) Had to get SMILES from name (some failed)
AB.2001.EJPS (+/-) Units are not clear to me
ABB.2000.PR (+/-) Units are not clear to me
BOM.2017.JC (+)
D.2008.JCIC (+)
H.2000.test1 (+) Downloaded from the website
H.2000.test2 (+) Downloaded from the website
H.2000.train (+) Downloaded from the website
HXZ.2004.JCIC.data_set (+) Downloaded from the website
HXZ.2004.JCIC.test_set1 (+) Downloaded from the website
LGG.2008.JCIM.32 (+)
LGG.2008.JCIM.100 (+)
LPB.2013.JCIC [all] (-) Can't understand the format of the data!
POG.2007.JCIM.test (+) Data obtained from authors
POG.2007.JCIM.train (+) Data obtained from authors
WKH.2007.JCIM.solubility (+) ADME website data
WXY.2009.JCIM (+/-) Data in SLN format. Set-003 broken.
OCHEM.WaterSolubility (+/-) Lots of repeats, some sign error
PubChem (+/-) No logS0 data, Measurements at pH=7.4

Papers

  1. Can You Predict Solubilities of Thirty-Two Molecules Using a Database of One Hundred Reliable Measurements?
    Antonio Llinàs, Robert C. Glen and Jonathan M. Goodman
    J. Chem. Inf. Modeling 2008, 48, 1289-1303
    [paper 1]
    [paper 2]
    [website]
    Note 0: This is the reference for the original Solubility Challange
    Note 1: In the test set, SMILES strings for probenecid and pseudoephedrine were swapped. Use only soldataswap.xls file.
    Note 2: Solubility for 32 compounds taken from HEL.2009.JCIM.pdf
    Note 3: Data was downloaded from the original website, but the numbers are dubious (IMO) - use CAREFULLY!

  2. ESOL: Estimating Aqueous Solubility Directly from Molecular Structure
    John S. Delaney
    J. Chem. Inf. Comput. Sci. 2004, 44, 1000-1005
    [paper]
    Note: There are two files D.2008.JCIC.solubility.v[1-2].txt. These files are the same but come from two different sources: (i) Pat Walters Blog (ii) ChemDB

  3. Can You Predict Solubilities of Thirty-Two Molecules Using a Database of One Hundred Reliable Measurements?
    Jarmo Huuskonen J. Chem. Inf. Comput. Sci. 2000, 40, 773-777
    [paper]
    [website]
    Note: Quite a few repeats from Delaney Set. Different measurements, though.

  4. ADME evaluation in drug discovery. 4. Prediction of aqueous solubility based on atom contribution approach Tingjun Hou, Ke Xia, Wei Zhang, Xiaojie Xu
    Journal of Chemical Information and Computer Sciences, 2004, 44, 266-275
    [paper]
    [website]

  5. Development of reliable aqueous solubility models and their application in drug-like analysis
    Junmei Wang, George Krudy, Tingjun Hou, George Holland, Xiaojie Xu
    Journal of Chemical Information and Modeling, 2007, 47, 1395-1404
    [paper]
    [website]
    Note: In logS database, the aqueous solubility was expressed as logS, where S is the solubility at a temperature of 20-25°C in mol/L. These are two databases for our modeling. In reference [4], the data afforded by Tetko was used. This database includes 1290 organic compounds. The data set was converted from the SMILES flat file representation to the MACCS/sdf structured data file. In reference [5], some new molecules collected from literature were added. This database includes 1708 molecules.

  6. Can human experts predict solubility better than computers?
    Samuel Boobier, Anne Osbourn and John B. O. Mitchell
    Journal of Cheminformatics, 2017, 9:63
    [paper]
    [website]
    Note: Source codes accompany the paper.

  7. pH-metric solubility. 3. Dissolution titration template method for solubility determination
    Alex Avdeef, Cynthia M. Berger
    European Journal of Pharmaceutical Sciences 14 (2001) 281–29
    [paper]

  8. pH-Metric Solubility. 2: Correlation Between the Acid-Base Titration and the Saturation Shake-Flask Solubility-pH Methods
    Alex Avdeef, Cynthia M. Berger, and Charles Brownell
    Pharmaceutical Research, Vol. 17, No. 1, 2000
    [paper]

  9. Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information Iurii Sushko et al.,
    J Comput Aided Mol Des (2011) 25:533–554
    [paper]
    [server]

  10. Solubility Challenge revisited after 10 years, with multi-lab shake- flask data, using tight (SD 0.17 log) and loose (SD 0.62 log) test sets
    Antonio Llinas, and Alex Avdeef
    J. Chem. Inf. Model., 2019
    [paper]
    Note: The reference for the new challange.

  11. Random Forest Models To Predict Aqueous Solubility
    David S. Palmer, Noel M. O’Boyle, Robert C. Glen, and John B. O. Mitchell
    J. Chem. Inf. Model. 2007,471, 150-158
    [paper]
    Note: Data extracted from pdfs

  12. Deep Architectures and Deep Learning in Chemoinformatics
    Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi
    J. Chem. Inf. Model. 2013,537, 1563-1575
    [paper]
    Note: Some of the files/data are duplicates

  13. Is Experimental Data Quality the Limiting Factor in Predicting the Aqueous Solubility of Druglike Molecules?
    David S. Palmer and John B. O. Mitchell
    Mol. Pharmaceutics 2014, 11, 2962−2972
    [paper]
    Note: Good overview of the sources of the errors in solubility prediction.

  14. Convolutional Networks on Graphs for Learning Molecular Fingerprints
    David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P. Adams.
    arXiv, 2015:
    [paper]
    [code]
    Note1: Original code in Python 2. In order to make it work use futurize to convert to Python 3
    Note2: install with python setup.py install

  15. Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling
    Vladimir Svetnik, Andy Liaw, Christopher Tong, J. Christopher Culberson, Robert P. Sheridan, and Bradley P. Feuston
    J. Chem. Inf. Comput. Sci. 2003, 43, 1947-1958
    [paper]

  16. Binary Classification of Aqueous Solubility Using Support Vector Machines with Reduction and Recombination Feature Selection
    Cheng, T., Li, Q., Wang, Y., and Bryant, S.H.
    Journal of Chemical Information and Modeling, 2011, 51, 229-236
    [paper]
    Note: The measurements come from BioAssay AID:1996, and are done at pH=7.4. Not very useful for a prediction of logS0.

  17. Aqueous Solubility Prediction Based on Weighted Atom Type Counts and Solvent Accessible Surface Areas
    Junmei Wang, Tingjun Hou, and Xiaojie Xu
    J. Chem. Inf. Model. 2009, 49, 571–581
    [paper]
    Note: (i) Data in SLN format. CIRpy needed to convert to smiles. (ii) Set-003 looks suspicious, so I excluded it from the train data.

  18. Multi-lab intrinsic solubility measurement reproducibility in CheqSol and shake-flask methods
    Alex Avdeef
    ADMET & DMPK
    [paper]

License

The library is open-source for academic and education users. If you want to use the library in any of your work please cite: Pawel Gniewek, Solubility prediction of drug-like compounds, https://github.com/pgniewko/solubility.