This repository hosts Therapeutics Data Commons (TDC), an open, user-friendly and extensive dataset hub for medicinal machine learning tasks. So far, it includes more than 100+ datasets for 20+ tasks (ranging from target identification, virtual screening, QSAR to patient recruitment, safety survellience and etc) in most of the drug development stages (from discovery and development to clinical trials and post-market monitoring).
- Extensive: covers 100+ datasets for 20+ tasks in most of the drug development stages.
- Ready-to-use: the output can directly feed into prediction library such as scikit-learn and DeepPurpose.
- User-friendly: very easy to load the dataset (3 lines of codes) and supports various useful functions such as conversion to DGL/PyG graph for interaction data, cold/scaffold split, label distribution visualization, binarize, log-conversion and so much more!
- Benchmark: provides a benchmark mode for fair comparison. We also provide a leaderboard!
- Easy-to-contribute: provides a very simple way to contribute a new dataset (just write a loading function, see CONTRIBUTE page)!
GIF placeholder
![](fig/example.gif)
CLICK HERE FOR THE CODE!
from tdc.property_pred import ADME
data = ADME(name = 'LogD74')
# scaffold split using benchmark seed
split = data.get_split(method = 'scaffold', seed = 'benchmark')
# visualize label distribution
data.label_distribution()
# binarize
data.binarize()
# convert to log
data.conver_to_log()
# get data in the various formats
data.get_data(format = 'DeepPurpose')
pip install tdc
arxiv placeholder
We have X task formulations and each is associated with many datasets. For example, ADMET is a task formulation and it has its own many datasets. To call a dataset Y from task formulation X, simply calling X(name = Y)
.
-
Absorption, Distribution, Metabolism, and Excretion
ADME
CLICK HERE FOR THE DATASETS!
Dataset Name Description Reference Type Stats AstraZeneca
ADME(name = 'Lipophilicity_AstraZeneca')
Lipophilicity is a dataset curated from ChEMBL database containing experimental results on octanol/water distribution coefficient (logD at pH=7.4). From MoleculeNet. AstraZeneca. Experimental in vitro Dmpk and physicochemical data on a set of publicly disclosed compounds (2016) Regression 4,200 Drugs LogD74
ADME(name = 'Lipophilicity_Wang')
A high-quality hand-curated lipophilicity dataset that includes the chemical structure of 1,130 organic compounds and their n-octanol/buffer solution distribution coefficients at pH 7.4 (logD7.4). Wang, J-B., D-S. Cao, M-F. Zhu, Y-H. Yun, N. Xiao, Y-Z. Liang (2015). In silico evaluation of logD7.4 and comparison with other prediction methods. Journal of Chemometrics, 29(7), 389-398. Regression 1,094 Drugs Dataset Name Description Reference Type Stats AqSolDB
ADME(name = 'Solubility_AqSolDB')
AqSolDB: A curated reference set of aqueous solubility, created by the Autonomous Energy Materials Discovery [AMD] research group, consists of aqueous solubility values of 9,982 unique compounds curated from 9 different publicly available aqueous solubility datasets. Sorkun, M.C., Khetan, A. & Er, S. AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Sci Data 6, 143 (2019). Regression 9,982 Drugs ESOL
ADME(name = 'Solubility_ESOL')
ESOL (delaney) is a standard regression dataset containing structures and water solubility data for 1128 compounds. From MoleculeNet. Delaney, John S. "ESOL: estimating aqueous solubility directly from molecular structure." Journal of chemical information and computer sciences 44.3 (2004): 1000-1005. Regression 1,128 Drugs FreeSolv
ADME(name = 'HydrationFreeEnergy_FreeSolv')
The Free Solvation Database, FreeSolv(SAMPL), provides experimental and calculated hydration free energy of small molecules in water. The calculated values are derived from alchemical free energy calculations using molecular dynamics simulations. From MoleculeNet. Mobley, David L., and J. Peter Guthrie. "FreeSolv: a database of experimental and calculated hydration free energies, with input files." Journal of computer-aided molecular design 28.7 (2014): 711-720. Regression 642 Drugs Dataset Name Description Reference Type Stats Caco-2
ADME(name = 'Caco2_Wang')
The Caco-2 cell effective permeability (Peff) is an in vitro approximation of the rate at which the drug passes through intestinal tissue. Ning-Ning Wang, Jie Dong, Yin-Hua Deng, Min-Feng Zhu, Ming Wen, Zhi-Jiang Yao, Ai-Ping Lu, Jian-Bing Wang, and Dong-Sheng Cao. Journal of Chemical Information and Modeling 2016 56 (4), 763-773 Regression 910 Drugs HIA
ADME(name = 'HIA_Hou')
The human intestinal absorption (HIA) means the process of orally administered drugs are absorbed from the gastrointestinal system into the bloodstream of the human body. Hou T, Wang J, Zhang W, Xu X. ADME evaluation in drug discovery. 7. Prediction of oral absorption by correlation and classification. J Chem Inf Model. 2007;47(1):208-218. doi:10.1021/ci600343x Binary 578 Drugs Pgp ADME(name = 'Pgp_Broccatelli')
P-glycoprotein (Pgp or ABCB1) is an ABC transporter protein involved in intestinal absorption, drug metabolism, and brain penetration, and its inhibition can seriously alter a drug's bioavailability and safety. In addition, inhibitors of Pgp can be used to overcome multidrug resistance. A Novel Approach for Predicting P-Glycoprotein (ABCB1) Inhibition Using Molecular Interaction Fields. Fabio Broccatelli, Emanuele Carosati, Annalisa Neri, Maria Frosini, Laura Goracci, Tudor I. Oprea, and Gabriele Cruciani. Journal of Medicinal Chemistry 2011 54 (6), 1740-1751 Binary 1,267 Drugs Bioavailability
ADME(name = 'Bioavailability_Ma')
Oral bioavailability is defined as (taking the FDA's definition) “the rate and extent to which the active ingredient or active moiety is absorbed from a drug product and becomes available at the site of action”. Ma, Chang-Ying, et al. "Prediction models of human plasma protein binding rate and oral bioavailability derived by using GA–CG–SVM method." Journal of pharmaceutical and biomedical analysis 47.4-5 (2008): 677-682. Binary 640 Drugs Bioavailability_F20_eDrug3D
ADME(name = 'F20_eDrug3D')
Oral bioavailability is defined as (taking the FDA's definition) “the rate and extent to which the active ingredient or active moiety is absorbed from a drug product and becomes available at the site of action”. Processed from eDrug3D dataset. Using 20% as the threshold. Pihan E, Colliandre L, Guichou JF, Douguet D. e-Drug3D: 3D structure collections dedicated to drug repurposing and fragment-based drug design. Bioinformatics. 2012;28(11):1540-1541. Binary 403 Drugs Bioavailability_F30_eDrug3D
ADME(name = 'F30_eDrug3D')
Oral bioavailability is defined as (taking the FDA's definition) “the rate and extent to which the active ingredient or active moiety is absorbed from a drug product and becomes available at the site of action”. Processed from eDrug3D dataset. Using 30% as the threshold. Pihan E, Colliandre L, Guichou JF, Douguet D. e-Drug3D: 3D structure collections dedicated to drug repurposing and fragment-based drug design. Bioinformatics. 2012;28(11):1540-1541. Binary 403 Drugs Dataset Name Description Reference Type Stats BBB
ADME(name = 'BBB_Adenot')
The blood–brain barrier (BBB) is a highly selective semipermeable border of endothelial cells that prevents solutes in the circulating blood from non-selectively crossing into the extracellular fluid of the central nervous system where neurons reside. Adenot M, Lahana R. Blood-brain barrier permeation models: discriminating between potential CNS and non-CNS drugs including P-glycoprotein substrates. J Chem Inf Comput Sci. 2004;44(1):239-248. BBB_MolNet ADME(name = 'BBB_MolNet')
The blood-brain barrier penetration (BBB) dataset is extracted from a study on the modeling and prediction of the barrier permeability. As a membrane separating circulating blood and brain extracellular fluid, the blood-brain barrier blocks most drugs, hormones and neurotransmitters. Thus penetration of the barrier forms a long-standing issue in development of drugs targeting central nervous system. This dataset includes binary labels for over 2000 compounds on their permeability properties. From MoleculeNet. Martins, Ines Filipa, et al. "A Bayesian approach to in silico blood-brain barrier penetration modeling." Journal of chemical information and modeling 52.6 (2012): 1686-1697. PPBR
ADME(name = 'PPBR_Ma')
The human plasma protein binding rate (PPBR) is expressed as the percentage of a drug bound to plasma proteins. Medications attach to proteins within the blood. A drug's efficiency may be affected by the degree to which it binds. The less bound a drug is, the more efficiently it can traverse cell membranes or diffuse. Ma, Chang-Ying, et al. "Prediction models of human plasma protein binding rate and oral bioavailability derived by using GA–CG–SVM method." Journal of pharmaceutical and biomedical analysis 47.4-5 (2008): 677-682. PPBR_eDrug3D
ADME(name = 'PPBR_eDrug3D')
The human plasma protein binding rate (PPBR) is expressed as the percentage of a drug bound to plasma proteins. Medications attach to proteins within the blood. A drug's efficiency may be affected by the degree to which it binds. The less bound a drug is, the more efficiently it can traverse cell membranes or diffuse. Processed from eDrug3D dataset. Pihan E, Colliandre L, Guichou JF, Douguet D. e-Drug3D: 3D structure collections dedicated to drug repurposing and fragment-based drug design. Bioinformatics. 2012;28(11):1540-1541. VD_eDrug3D
ADME(name = 'VD_eDrug3D')
The volume of distribution is the theoretical volume that would be necessary to contain the total amount of an administered drug at the same concentration that it is observed in the blood plasma. Processed from eDrug3D dataset. Pihan E, Colliandre L, Guichou JF, Douguet D. e-Drug3D: 3D structure collections dedicated to drug repurposing and fragment-based drug design. Bioinformatics. 2012;28(11):1540-1541. Dataset Name Description Reference Type Stats CYP2C19
ADME(name = 'CYP2C19_Veith')
The CYP P450 genes are involved in the formation and breakdown (metabolism) of various molecules and chemicals within cells. Specifically, the CYP P450 2C19 gene provide instructions for making an enzyme that is found primarily in liver cells in a cell structure called the endoplasmic reticulum, which is involved in protein processing and transport. Veith, Henrike et al. “Comprehensive characterization of cytochrome P450 isozyme selectivity across chemical libraries.” Nature biotechnology vol. 27,11 (2009): 1050-5.; PubChem AID1851 Binary 12,665 Drugs CYP2D6
ADME(name = 'CYP2D6_Veith')
The CYP P450 genes are involved in the formation and breakdown (metabolism) of various molecules and chemicals within cells. Specifically, CYP2D6 is primarily expressed in the liver. It is also highly expressed in areas of the central nervous system, including the substantia nigra. Veith, Henrike et al. “Comprehensive characterization of cytochrome P450 isozyme selectivity across chemical libraries.” Nature biotechnology vol. 27,11 (2009): 1050-5.; PubChem AID1851 Binary 13,130 Drugs CYP3A4
ADME(name = 'CYP3A4_Veith')
The CYP P450 genes are involved in the formation and breakdown (metabolism) of various molecules and chemicals within cells. Specifically, CYP3A4 is an important enzyme in the body, mainly found in the liver and in the intestine. It oxidizes small foreign organic molecules (xenobiotics), such as toxins or drugs, so that they can be removed from the body. Veith, Henrike et al. “Comprehensive characterization of cytochrome P450 isozyme selectivity across chemical libraries.” Nature biotechnology vol. 27,11 (2009): 1050-5.; PubChem AID1851 Binary 12,328 Drugs CYP1A2
ADME(name = 'CYP1A2_Veith')
The CYP P450 genes are involved in the formation and breakdown (metabolism) of various molecules and chemicals within cells. Specifically, CYP1A2 localizes to the endoplasmic reticulum and its expression is induced by some polycyclic aromatic hydrocarbons (PAHs), some of which are found in cigarette smoke. It is able to metabolize some PAHs to carcinogenic intermediates. Other xenobiotic substrates for this enzyme include caffeine, aflatoxin B1, and acetaminophen. Veith, Henrike et al. “Comprehensive characterization of cytochrome P450 isozyme selectivity across chemical libraries.” Nature biotechnology vol. 27,11 (2009): 1050-5.; PubChem AID1851 Binary 12,579 Drugs CYP2C9
ADME(name = 'CYP2C9_Veith')
The CYP P450 genes are involved in the formation and breakdown (metabolism) of various molecules and chemicals within cells. Specifically, the CYP P450 2C9 plays a major role in the oxidation of both xenobiotic and endogenous compounds. Veith, Henrike et al. “Comprehensive characterization of cytochrome P450 isozyme selectivity across chemical libraries.” Nature biotechnology vol. 27,11 (2009): 1050-5.; PubChem AID1851 Binary 12,092 Drugs Dataset Name Description Reference Type Stats Half_life_eDrug3D
ADME(name = 'HalfLife_eDrug3D')
The duration of action of a drug is known as its half life. This is the period of time required for the concentration or amount of drug in the body to be reduced by one-half. Processed from eDrug3D dataset. Pihan E, Colliandre L, Guichou JF, Douguet D. e-Drug3D: 3D structure collections dedicated to drug repurposing and fragment-based drug design. Bioinformatics. 2012;28(11):1540-1541. Clearance_eDrug3D ADME(name = 'Clearance_eDrug3D')
Drug clearance is concerned with the rate at which the active drug is removed from the body. Clearance is defined as the rate of drug elimination divided by the plasma concentration of the drug. Processed from eDrug3D dataset. Pihan E, Colliandre L, Guichou JF, Douguet D. e-Drug3D: 3D structure collections dedicated to drug repurposing and fragment-based drug design. Bioinformatics. 2012;28(11):1540-1541. -
Toxicity
Toxicity
CLICK HERE FOR THE DATASETS!
Dataset Name Description Reference Tox21
Toxicity(name = 'Tox21', target = 'NR-AR')
, Choose target from here2014 Tox21 Data Challenge contains qualitative toxicity measurements for 8k compounds on 12 different targets, including nuclear receptors and stress response pathways. From MoleculeNet. Tox21 Challenge. ToxCast
Toxicity(name = 'ToxCast', target = 'ACEA_T47D_80hr_Negative')
, Choose target from hereToxCast includes qualitative results of over 600 experiments on 8k compounds. From MoleculeNet. Richard, Ann M., et al. "ToxCast chemical landscape: paving the road to 21st century toxicology." Chemical research in toxicology 29.8 (2016): 1225-1251. ClinTox
Toxicity(name = 'ClinTox')
The ClinTox dataset compares drugs that have failed clinical trials for toxicity reasons. From MoleculeNet. Gayvert, Kaitlyn M., Neel S. Madhukar, and Olivier Elemento. "A data-driven approach to predicting successes and failures of clinical trials." Cell chemical biology 23.10 (2016): 1294-1301. -
High Throughput Screening BioAssays
HTS
CLICK HERE FOR THE DATASETS!
Dataset Name Description Reference Type Stats SARS-CoV2 in vitro HTS(name = 'SARSCoV2_Vitro_Touret')
In-vitro screend the PRESTWICK CHEMICAL LIBRARY composed of 1,520 approved drugs in an infected cell-based assay. Touret, F., Gilles, M., Barral, K. et al. In vitro screening of a FDA approved chemical library reveals potential inhibitors of SARS-CoV-2 replication. Sci Rep 10, 13093 (2020). Binary SARS-CoV2 3CLPro HTS(name = 'SARSCoV2_3CLPro_Diamond')
A large XChem crystallographic fragment screen against SARS-CoV-2 main protease at high resolution. Diamond Light Source Binary HIV HTS(name = 'HIV')
The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. From MoleculeNet. AIDS Antiviral Screen Data. https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data Binary 41,127 Drugs -
Quantum Mechanics
QM
CLICK HERE FOR THE DATASETS!
Dataset Name Description Reference Type Stats QM7
QM(name = 'QM7, target = 'X')
Choose target from hereThis dataset is for multitask learning where 14 properties (e.g. polarizability, HOMO and LUMO eigenvalues, excitation energies) have to be predicted at different levels of theory (ZINDO, SCS, PBE0, GW). From MoleculeNet. ML. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012. Regression 7,211 drugs QM8
QM(name = 'QM8, target = 'X')
Choose target from hereTElectronic spectra and excited state energy of small molecules calculated by multiple quantum mechanic methods. From MoleculeNet. ML. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012. Regression 22,000 drugs QM9
QM(name = 'QM9, target = 'X')
Choose target from hereGeometric, energetic, electronic and thermodynamic properties of DFT-modelled small molecules. From MoleculeNet. R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Quantum chemistry structures and properties of 134 kilo molecules, Scientific Data 1, 140022, 2014. Regression 22,000 drugs
-
Drug-Target Interaction Prediction Dataset
DTI
CLICK HERE FOR THE DATASETS!
Dataset Name Description Reference Type Stats (pairs/#drugs/#targets) BindingDB
DTI(name = 'BindingDB_X')
Choose X from Kd, IC50, EC50, or KiBindingDB is a public, web-accessible database of measured binding affinities, focusing chiefly on the interactions of protein considered to be drug-targets with small, drug-like molecules. BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities Regression (log)/Binary 66,444/10,665/1,413 for Kd, 1,073,803/549,205/5,078 for IC50, 151,413/91,773/1,240 for EC50, 41,0478/174,662/3,070 for Ki DAVIS
DTI(name = 'DAVIS')
The interaction of 72 kinase inhibitors with 442 kinases covering >80% of the human catalytic protein kinome. Davis, M., Hunt, J., Herrgard, S. et al. Comprehensive analysis of kinase inhibitor selectivity. Nat Biotechnol 29, 1046–1051 (2011). Regression (log)/Binary 30,056/68/379 KIBA
DTI(name = 'KIBA')
An integrated drug-target bioactivity matrix across 52,498 chemical compounds and 467 kinase targets, including a total of 246,088 KIBA scores, has been made freely available. Tang J, Szwajda A, Shakyawar S, et al. Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. J Chem Inf Model. 2014;54(3):735-743. Regression 118,254/2,068/229 -
Drug-Drug Interaction Prediction Dataset
DDI
CLICK HERE FOR THE DATASETS!
Dataset Name Description Reference Type Stats (pairs/#drugs) DrugBank DrugBank drug-drug interaction dataset is manually sourced from FDA/Health Canada drug labels as well as primary literature. It has 86 interaction types. Drug SMILES is provided. Wishart DS, et al. (2017) DrugBank 5.0: A major update to the DrugBank database for 2018. Nucleic Acids Res 46:D1074–D1082. Multi-Class/Network 191,519/1,706 TWOSIDES Polypharmacy side-effects are associated with drug pairs (or higher-order drug combinations) and cannot be attributed to either individual drug in the pair (in a drug combination). Tatonetti, Nicholas P., et al. Data-driven prediction of drug effects and interactions. Science Translational Medicine. 2012. Multi-Label/Network 4,649,441/645 -
Protein-Protein Interaction Prediction Dataset
PPI
CLICK HERE FOR THE DATASETS!
Dataset Name Description Reference Type Stats (pairs/#proteins) HuRI
PPI(name = 'HuRI)
All pairwise combinations of human protein-coding genes are systematically being interrogated to identify which are involved in binary protein-protein interactions. In our most recent effort 17,500 proteins have been tested and a first human reference interactome (HuRI) map has been generated. From the Center for Cancer Systems Biology at Dana-Farber Cancer Institute. Note that the feature is peptide sequence, if a protein gene is associated with multiple peptides, we separate them by '*'. Luck, K., Kim, D., Lambourne, L. et al. A reference map of the human binary protein interactome. Nature 580, 402–408 (2020). Binary/Network 51,813/8,248 -
Peptide-MHC Binding Prediction Dataset
PeptideMHC
CLICK HERE FOR THE DATASETS!
Dataset Name Description Reference Type Stats (pairs/#peptides/#ofMHCs) MHC1_NetMHCpan
PeptideMHC(name = 'MHC1_NetMHCpan')
Binding of peptides to MHC class I molecules (MHC-I) is essential for antigen presentation to cytotoxic T-cells. An organized datasets for MHC class I collected from IEDB and IMGT/HLA database. Nielsen, Morten, and Massimo Andreatta. "NetMHCpan-3.0; improved prediction of binding to MHC class I molecules integrating information from multiple receptor and peptide length datasets." Genome medicine 8.1 (2016): 1-9. Regression 185,985/43,018/150 MHC2_NetMHCIIpan
PeptideMHC(name = 'MHC2_NetMHCIIpan')
Major histocompatibility complex class II (MHC‐II) molecules are found on the surface of antigen‐presenting cells where they present peptides derived from extracellular proteins to T helper cells. Useful to identify T‐cell epitopes. An organized datasets for MHC class II collected from IEDB database. Jensen, Kamilla Kjaergaard, et al. "Improved methods for predicting peptide binding affinity to MHC class II molecules." Immunology 154.3 (2018): 394-406. Regression 134,281/17,003/75
-
Paired Molecule Generation
MolGenPaired
CLICK HERE FOR THE DATASETS!
Dataset Name Description Reference Type Stats (#pairs/#drugs) DRD2
MolGenPaired(name = 'DRD2')
34,404/21,703 QED
MolGenPaired(name = 'QED')
88,306/52,262 logP
MolGenPaired(name = 'LogP')
99,909/99,794 JNK3 GSK-3beta -
Retrosynthesis
RETRO
CLICK HERE FOR THE DATASETS!
Dataset Name Description Reference Type Stats (#drugs) USPTO-50K -
Forwardsynthesis
FORWARD
CLICK HERE FOR THE DATASETS!
Dataset Name Description Reference Type Stats (#drugs) USPTO-50K -
Reaction Prediction
REACT
CLICK HERE FOR THE DATASETS!
Dataset Name Description Reference Type Stats (#drugs) USPTO-50K
To retrieve the dataset split, you could simply type
data = X(name = Y)
data.get_split(seed = 'benchmark')
# {'train': df_train, 'val': df_val, ''test': df_test}
You can specify the splitting method, random seed, and split fractions in the function by e.g. data.get_split(method = 'cold_drug', seed = 1, frac = [0.7, 0.1, 0.2])
. For drug property prediction, a scaffold split function is also provided. Simply set method = 'scaffold'
.
We are actively working on a more systematic way to benchmark and leaderboard methods. We would release this feature in the next version. In the meantime, if you have expertise or interest in helping build this feature, please send emails to kexinhuang@hsph.harvard.edu.
TDC is designed to rapidly conduct experiments. The data output can be directly used for powerful prediction packages. Here, we show how to use DeepPurpose for more advanced drugs/proteins encoders such as MPNN, Transformers and etc.
CLICK HERE FOR THE CODE!
TDC is designed to be a community-driven effort. We know DrugDataLoader only covers tip of iceberg of the data out there. You can easily upload your data by simply writing a function that takes the expected input and output. See step-by-step instruction in the CONTRIBUTE page.
Send emails to kexinhuang@hsph.harvard.edu or open an issue.
TDC is an open-source effort. Many datasets are aggregated from various public website sources. We use the Attribution-NonCommercial-ShareAlike 4.0 International license to suffice many datasets requirement. If it still infringes the copyright of the dataset author, please let us know and we will take it down ASAP.