/Molecules_Dataset_Collection

Collection of data sets of molecules for a validation of properties inference

MIT LicenseMIT

Collection of data sets of molecules and properties 🎁 😄

What is it?

  • Inspired by Moleculenet.ai
  • Selection of data sets of molecules (SMILES) and physicochemical properties

Aim?

  1. SMILES in the data sets have all been uniformized through the RDKit
  2. Cluster the data sets at the same place. They are all here!
  3. Use it for validating the inference of molecular properties through various machine learning models as proposed in Z. Wu et al.

Method?

  • All data sets are regularized following the RDKit methods to output isomeric, canonical and kekulise SMILES (Daylight)
  • If a SMILES was not successfully regularized, a blank replaces the SMILES compared to the original data set

But what are these data sets?

  • Quantum Mechanics: QM9
  • Physical Chemistry: ESOL, FreeSolv, Lipophilicity
  • Biophysics: PCBA, HIV, BACE
  • Physiology: BBBP, Tox21, ToxCast, SIDER, ClinTox

From Moleculenet.ai, here are their short description and the task for inference between squared brackets (for the regularized data sets reported here):

  • QM9: Geometric, energetic, electronic and thermodynamic properties of DFT-modelled small molecules [classification]

  • ESOL: Water solubility data(log solubility in mols per litre) for common organic small molecules [regression]

  • FreeSolv: Experimental and calculated hydration free energy of small molecules in water [regression]

  • Lipophilicity: Experimental results of octanol/water distribution coefficient(logD at pH 7.4) [regression]

  • PCBA: Selected from PubChem BioAssay, consisting of measured biological activities of small molecules generated by high-throughput screening [classification]

  • HIV: Experimentally measured abilities to inhibit HIV replication [classification]

  • BACE: Quantitative (IC50) and qualitative (binary label) binding results for a set of inhibitors of human β-secretase 1(BACE-1) [classification/regression]

  • BBBP: Binary labels of blood-brain barrier penetration(permeability) [classification]

  • Tox21: Qualitative toxicity measurements on 12 biological targets, including nuclear receptors and stress response pathways [classification]

  • ToxCast: Toxicology data for a large library of compounds based on in vitro high-throughput screening, including experiments on over 600 tasks [classification]

  • SIDER: Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes [classification]

  • ClinTox: Qualitative data of drugs approved by the FDA and those that have failed clinical trials for toxicity reasons [classification]

Citation

Source: Moleculenet.ai

Paper: Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, Vijay Pande, MoleculeNet: A Benchmark for Molecular Machine Learning, arXiv: 1703.00564, 2017 [cs.LG]