TDC-DATASET: A Python repository from CAVED123

This repository hosts Therapeutics Data Commons (TDC), an open, user-friendly and extensive dataset hub for medicinal machine learning tasks. So far, it includes more than 100+ datasets for 20+ tasks (ranging from target identification, virtual screening, QSAR to patient recruitment, safety survellience and etc) in most of the drug development stages (from discovery and development to clinical trials and post-market monitoring).

Features

Extensive: covers 100+ datasets for 20+ tasks in most of the drug development stages.
Ready-to-use: the output can directly feed into prediction library such as scikit-learn and DeepPurpose.
User-friendly: very easy to load the dataset (3 lines of codes) and supports various useful functions such as conversion to DGL/PyG graph for interaction data, cold/scaffold split, label distribution visualization, binarize, log-conversion and so much more!
Benchmark: provides a benchmark mode for fair comparison. We also provide a leaderboard!
Easy-to-contribute: provides a very simple way to contribute a new dataset (just write a loading function, see CONTRIBUTE page)!

Example

GIF placeholder
![](fig/example.gif)

CLICK HERE FOR THE CODE!

from tdc.property_pred import ADME
data = ADME(name = 'LogD74')
# scaffold split using benchmark seed
split = data.get_split(method = 'scaffold', seed = 'benchmark')
# visualize label distribution
data.label_distribution()
# binarize 
data.binarize()
# convert to log
data.conver_to_log()
# get data in the various formats
data.get_data(format = 'DeepPurpose')

Installation

pip install tdc

Cite

arxiv placeholder

Core Data Overview

We have X task formulations and each is associated with many datasets. For example, ADMET is a task formulation and it has its own many datasets. To call a dataset Y from task formulation X, simply calling X(name = Y).

Property Prediction

Absorption, Distribution, Metabolism, and ExcretionADME

CLICK HERE FOR THE DATASETS!

Lipophilicity

Dataset Name	Description	Reference	Type	Stats
AstraZeneca `ADME(name = 'Lipophilicity_AstraZeneca')`	Lipophilicity is a dataset curated from ChEMBL database containing experimental results on octanol/water distribution coefficient (logD at pH=7.4). From MoleculeNet.	AstraZeneca. Experimental in vitro Dmpk and physicochemical data on a set of publicly disclosed compounds (2016)	Regression	4,200 Drugs
LogD74 `ADME(name = 'Lipophilicity_Wang')`	A high-quality hand-curated lipophilicity dataset that includes the chemical structure of 1,130 organic compounds and their n-octanol/buffer solution distribution coefficients at pH 7.4 (logD7.4).	Wang, J-B., D-S. Cao, M-F. Zhu, Y-H. Yun, N. Xiao, Y-Z. Liang (2015). In silico evaluation of logD7.4 and comparison with other prediction methods. Journal of Chemometrics, 29(7), 389-398.	Regression	1,094 Drugs

Solubility

Dataset Name	Description	Reference	Type	Stats
AqSolDB `ADME(name = 'Solubility_AqSolDB')`	AqSolDB: A curated reference set of aqueous solubility, created by the Autonomous Energy Materials Discovery [AMD] research group, consists of aqueous solubility values of 9,982 unique compounds curated from 9 different publicly available aqueous solubility datasets.	Sorkun, M.C., Khetan, A. & Er, S. AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Sci Data 6, 143 (2019).	Regression	9,982 Drugs
ESOL `ADME(name = 'Solubility_ESOL')`	ESOL (delaney) is a standard regression dataset containing structures and water solubility data for 1128 compounds. From MoleculeNet.	Delaney, John S. "ESOL: estimating aqueous solubility directly from molecular structure." Journal of chemical information and computer sciences 44.3 (2004): 1000-1005.	Regression	1,128 Drugs
FreeSolv `ADME(name = 'HydrationFreeEnergy_FreeSolv')`	The Free Solvation Database, FreeSolv(SAMPL), provides experimental and calculated hydration free energy of small molecules in water. The calculated values are derived from alchemical free energy calculations using molecular dynamics simulations. From MoleculeNet.	Mobley, David L., and J. Peter Guthrie. "FreeSolv: a database of experimental and calculated hydration free energies, with input files." Journal of computer-aided molecular design 28.7 (2014): 711-720.	Regression	642 Drugs

Absorption

Dataset Name	Description	Reference	Type	Stats
Caco-2 `ADME(name = 'Caco2_Wang')`	The Caco-2 cell effective permeability (Peff) is an in vitro approximation of the rate at which the drug passes through intestinal tissue.	Ning-Ning Wang, Jie Dong, Yin-Hua Deng, Min-Feng Zhu, Ming Wen, Zhi-Jiang Yao, Ai-Ping Lu, Jian-Bing Wang, and Dong-Sheng Cao. Journal of Chemical Information and Modeling 2016 56 (4), 763-773	Regression	910 Drugs
HIA `ADME(name = 'HIA_Hou')`	The human intestinal absorption (HIA) means the process of orally administered drugs are absorbed from the gastrointestinal system into the bloodstream of the human body.	Hou T, Wang J, Zhang W, Xu X. ADME evaluation in drug discovery. 7. Prediction of oral absorption by correlation and classification. J Chem Inf Model. 2007;47(1):208-218. doi:10.1021/ci600343x	Binary	578 Drugs
Pgp `ADME(name = 'Pgp_Broccatelli')`	P-glycoprotein (Pgp or ABCB1) is an ABC transporter protein involved in intestinal absorption, drug metabolism, and brain penetration, and its inhibition can seriously alter a drug's bioavailability and safety. In addition, inhibitors of Pgp can be used to overcome multidrug resistance.	A Novel Approach for Predicting P-Glycoprotein (ABCB1) Inhibition Using Molecular Interaction Fields. Fabio Broccatelli, Emanuele Carosati, Annalisa Neri, Maria Frosini, Laura Goracci, Tudor I. Oprea, and Gabriele Cruciani. Journal of Medicinal Chemistry 2011 54 (6), 1740-1751	Binary	1,267 Drugs
Bioavailability `ADME(name = 'Bioavailability_Ma')`	Oral bioavailability is defined as (taking the FDA's definition) “the rate and extent to which the active ingredient or active moiety is absorbed from a drug product and becomes available at the site of action”.	Ma, Chang-Ying, et al. "Prediction models of human plasma protein binding rate and oral bioavailability derived by using GA–CG–SVM method." Journal of pharmaceutical and biomedical analysis 47.4-5 (2008): 677-682.	Binary	640 Drugs
Bioavailability_F20_eDrug3D `ADME(name = 'F20_eDrug3D')`	Oral bioavailability is defined as (taking the FDA's definition) “the rate and extent to which the active ingredient or active moiety is absorbed from a drug product and becomes available at the site of action”. Processed from eDrug3D dataset. Using 20% as the threshold.	Pihan E, Colliandre L, Guichou JF, Douguet D. e-Drug3D: 3D structure collections dedicated to drug repurposing and fragment-based drug design. Bioinformatics. 2012;28(11):1540-1541.	Binary	403 Drugs
Bioavailability_F30_eDrug3D `ADME(name = 'F30_eDrug3D')`	Oral bioavailability is defined as (taking the FDA's definition) “the rate and extent to which the active ingredient or active moiety is absorbed from a drug product and becomes available at the site of action”. Processed from eDrug3D dataset. Using 30% as the threshold.	Pihan E, Colliandre L, Guichou JF, Douguet D. e-Drug3D: 3D structure collections dedicated to drug repurposing and fragment-based drug design. Bioinformatics. 2012;28(11):1540-1541.	Binary	403 Drugs

Distribution

Dataset Name	Description	Reference
BBB `ADME(name = 'BBB_Adenot')`	The blood–brain barrier (BBB) is a highly selective semipermeable border of endothelial cells that prevents solutes in the circulating blood from non-selectively crossing into the extracellular fluid of the central nervous system where neurons reside.	Adenot M, Lahana R. Blood-brain barrier permeation models: discriminating between potential CNS and non-CNS drugs including P-glycoprotein substrates. J Chem Inf Comput Sci. 2004;44(1):239-248.
BBB_MolNet `ADME(name = 'BBB_MolNet')`	The blood-brain barrier penetration (BBB) dataset is extracted from a study on the modeling and prediction of the barrier permeability. As a membrane separating circulating blood and brain extracellular fluid, the blood-brain barrier blocks most drugs, hormones and neurotransmitters. Thus penetration of the barrier forms a long-standing issue in development of drugs targeting central nervous system. This dataset includes binary labels for over 2000 compounds on their permeability properties. From MoleculeNet.	Martins, Ines Filipa, et al. "A Bayesian approach to in silico blood-brain barrier penetration modeling." Journal of chemical information and modeling 52.6 (2012): 1686-1697.
PPBR `ADME(name = 'PPBR_Ma')`	The human plasma protein binding rate (PPBR) is expressed as the percentage of a drug bound to plasma proteins. Medications attach to proteins within the blood. A drug's efficiency may be affected by the degree to which it binds. The less bound a drug is, the more efficiently it can traverse cell membranes or diffuse.	Ma, Chang-Ying, et al. "Prediction models of human plasma protein binding rate and oral bioavailability derived by using GA–CG–SVM method." Journal of pharmaceutical and biomedical analysis 47.4-5 (2008): 677-682.
PPBR_eDrug3D `ADME(name = 'PPBR_eDrug3D')`	The human plasma protein binding rate (PPBR) is expressed as the percentage of a drug bound to plasma proteins. Medications attach to proteins within the blood. A drug's efficiency may be affected by the degree to which it binds. The less bound a drug is, the more efficiently it can traverse cell membranes or diffuse. Processed from eDrug3D dataset.	Pihan E, Colliandre L, Guichou JF, Douguet D. e-Drug3D: 3D structure collections dedicated to drug repurposing and fragment-based drug design. Bioinformatics. 2012;28(11):1540-1541.
VD_eDrug3D `ADME(name = 'VD_eDrug3D')`	The volume of distribution is the theoretical volume that would be necessary to contain the total amount of an administered drug at the same concentration that it is observed in the blood plasma. Processed from eDrug3D dataset.	Pihan E, Colliandre L, Guichou JF, Douguet D. e-Drug3D: 3D structure collections dedicated to drug repurposing and fragment-based drug design. Bioinformatics. 2012;28(11):1540-1541.

Metabolism

Dataset Name	Description	Reference	Type	Stats
CYP2C19 `ADME(name = 'CYP2C19_Veith')`	The CYP P450 genes are involved in the formation and breakdown (metabolism) of various molecules and chemicals within cells. Specifically, the CYP P450 2C19 gene provide instructions for making an enzyme that is found primarily in liver cells in a cell structure called the endoplasmic reticulum, which is involved in protein processing and transport.	Veith, Henrike et al. “Comprehensive characterization of cytochrome P450 isozyme selectivity across chemical libraries.” Nature biotechnology vol. 27,11 (2009): 1050-5.; PubChem AID1851	Binary	12,665 Drugs
CYP2D6 `ADME(name = 'CYP2D6_Veith')`	The CYP P450 genes are involved in the formation and breakdown (metabolism) of various molecules and chemicals within cells. Specifically, CYP2D6 is primarily expressed in the liver. It is also highly expressed in areas of the central nervous system, including the substantia nigra.	Veith, Henrike et al. “Comprehensive characterization of cytochrome P450 isozyme selectivity across chemical libraries.” Nature biotechnology vol. 27,11 (2009): 1050-5.; PubChem AID1851	Binary	13,130 Drugs
CYP3A4 `ADME(name = 'CYP3A4_Veith')`	The CYP P450 genes are involved in the formation and breakdown (metabolism) of various molecules and chemicals within cells. Specifically, CYP3A4 is an important enzyme in the body, mainly found in the liver and in the intestine. It oxidizes small foreign organic molecules (xenobiotics), such as toxins or drugs, so that they can be removed from the body.	Veith, Henrike et al. “Comprehensive characterization of cytochrome P450 isozyme selectivity across chemical libraries.” Nature biotechnology vol. 27,11 (2009): 1050-5.; PubChem AID1851	Binary	12,328 Drugs
CYP1A2 `ADME(name = 'CYP1A2_Veith')`	The CYP P450 genes are involved in the formation and breakdown (metabolism) of various molecules and chemicals within cells. Specifically, CYP1A2 localizes to the endoplasmic reticulum and its expression is induced by some polycyclic aromatic hydrocarbons (PAHs), some of which are found in cigarette smoke. It is able to metabolize some PAHs to carcinogenic intermediates. Other xenobiotic substrates for this enzyme include caffeine, aflatoxin B1, and acetaminophen.	Veith, Henrike et al. “Comprehensive characterization of cytochrome P450 isozyme selectivity across chemical libraries.” Nature biotechnology vol. 27,11 (2009): 1050-5.; PubChem AID1851	Binary	12,579 Drugs
CYP2C9 `ADME(name = 'CYP2C9_Veith')`	The CYP P450 genes are involved in the formation and breakdown (metabolism) of various molecules and chemicals within cells. Specifically, the CYP P450 2C9 plays a major role in the oxidation of both xenobiotic and endogenous compounds.	Veith, Henrike et al. “Comprehensive characterization of cytochrome P450 isozyme selectivity across chemical libraries.” Nature biotechnology vol. 27,11 (2009): 1050-5.; PubChem AID1851	Binary	12,092 Drugs

Excretion

Dataset Name	Description	Reference	Type	Stats
Half_life_eDrug3D `ADME(name = 'HalfLife_eDrug3D')`	The duration of action of a drug is known as its half life. This is the period of time required for the concentration or amount of drug in the body to be reduced by one-half. Processed from eDrug3D dataset.	Pihan E, Colliandre L, Guichou JF, Douguet D. e-Drug3D: 3D structure collections dedicated to drug repurposing and fragment-based drug design. Bioinformatics. 2012;28(11):1540-1541.
Clearance_eDrug3D `ADME(name = 'Clearance_eDrug3D')`	Drug clearance is concerned with the rate at which the active drug is removed from the body. Clearance is defined as the rate of drug elimination divided by the plasma concentration of the drug. Processed from eDrug3D dataset.	Pihan E, Colliandre L, Guichou JF, Douguet D. e-Drug3D: 3D structure collections dedicated to drug repurposing and fragment-based drug design. Bioinformatics. 2012;28(11):1540-1541.

ToxicityToxicity

CLICK HERE FOR THE DATASETS!

Dataset Name	Description	Reference
Tox21 `Toxicity(name = 'Tox21', target = 'NR-AR')`, Choose target from here	2014 Tox21 Data Challenge contains qualitative toxicity measurements for 8k compounds on 12 different targets, including nuclear receptors and stress response pathways. From MoleculeNet.	Tox21 Challenge.
ToxCast `Toxicity(name = 'ToxCast', target = 'ACEA_T47D_80hr_Negative')`, Choose target from here	ToxCast includes qualitative results of over 600 experiments on 8k compounds. From MoleculeNet.	Richard, Ann M., et al. "ToxCast chemical landscape: paving the road to 21st century toxicology." Chemical research in toxicology 29.8 (2016): 1225-1251.
ClinTox `Toxicity(name = 'ClinTox')`	The ClinTox dataset compares drugs that have failed clinical trials for toxicity reasons. From MoleculeNet.	Gayvert, Kaitlyn M., Neel S. Madhukar, and Olivier Elemento. "A data-driven approach to predicting successes and failures of clinical trials." Cell chemical biology 23.10 (2016): 1294-1301.

High Throughput Screening BioAssaysHTS

CLICK HERE FOR THE DATASETS!

Dataset Name	Description	Reference	Type	Stats
SARS-CoV2 in vitro `HTS(name = 'SARSCoV2_Vitro_Touret')`	In-vitro screend the PRESTWICK CHEMICAL LIBRARY composed of 1,520 approved drugs in an infected cell-based assay.	Touret, F., Gilles, M., Barral, K. et al. In vitro screening of a FDA approved chemical library reveals potential inhibitors of SARS-CoV-2 replication. Sci Rep 10, 13093 (2020).	Binary
SARS-CoV2 3CLPro `HTS(name = 'SARSCoV2_3CLPro_Diamond')`	A large XChem crystallographic fragment screen against SARS-CoV-2 main protease at high resolution.	Diamond Light Source	Binary
HIV `HTS(name = 'HIV')`	The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. From MoleculeNet.	AIDS Antiviral Screen Data. https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data	Binary	41,127 Drugs

Quantum MechanicsQM

CLICK HERE FOR THE DATASETS!

Dataset Name	Description	Reference	Type	Stats
QM7 `QM(name = 'QM7, target = 'X')` Choose target from here	This dataset is for multitask learning where 14 properties (e.g. polarizability, HOMO and LUMO eigenvalues, excitation energies) have to be predicted at different levels of theory (ZINDO, SCS, PBE0, GW). From MoleculeNet.	ML. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012.	Regression	7,211 drugs
QM8 `QM(name = 'QM8, target = 'X')` Choose target from here	TElectronic spectra and excited state energy of small molecules calculated by multiple quantum mechanic methods. From MoleculeNet.	ML. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012.	Regression	22,000 drugs
QM9 `QM(name = 'QM9, target = 'X')` Choose target from here	Geometric, energetic, electronic and thermodynamic properties of DFT-modelled small molecules. From MoleculeNet.	R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Quantum chemistry structures and properties of 134 kilo molecules, Scientific Data 1, 140022, 2014.	Regression	22,000 drugs

Interaction Prediction

Drug-Target Interaction Prediction DatasetDTI

CLICK HERE FOR THE DATASETS!

Dataset Name	Description	Reference	Type	Stats (pairs/#drugs/#targets)
BindingDB `DTI(name = 'BindingDB_X')` Choose X from Kd, IC50, EC50, or Ki	BindingDB is a public, web-accessible database of measured binding affinities, focusing chiefly on the interactions of protein considered to be drug-targets with small, drug-like molecules.	BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities	Regression (log)/Binary	66,444/10,665/1,413 for Kd, 1,073,803/549,205/5,078 for IC50, 151,413/91,773/1,240 for EC50, 41,0478/174,662/3,070 for Ki
DAVIS `DTI(name = 'DAVIS')`	The interaction of 72 kinase inhibitors with 442 kinases covering >80% of the human catalytic protein kinome.	Davis, M., Hunt, J., Herrgard, S. et al. Comprehensive analysis of kinase inhibitor selectivity. Nat Biotechnol 29, 1046–1051 (2011).	Regression (log)/Binary	30,056/68/379
KIBA `DTI(name = 'KIBA')`	An integrated drug-target bioactivity matrix across 52,498 chemical compounds and 467 kinase targets, including a total of 246,088 KIBA scores, has been made freely available.	Tang J, Szwajda A, Shakyawar S, et al. Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. J Chem Inf Model. 2014;54(3):735-743.	Regression	118,254/2,068/229

Drug-Drug Interaction Prediction DatasetDDI

CLICK HERE FOR THE DATASETS!

Dataset Name	Description	Reference	Type	Stats (pairs/#drugs)
DrugBank	DrugBank drug-drug interaction dataset is manually sourced from FDA/Health Canada drug labels as well as primary literature. It has 86 interaction types. Drug SMILES is provided.	Wishart DS, et al. (2017) DrugBank 5.0: A major update to the DrugBank database for 2018. Nucleic Acids Res 46:D1074–D1082.	Multi-Class/Network	191,519/1,706
TWOSIDES	Polypharmacy side-effects are associated with drug pairs (or higher-order drug combinations) and cannot be attributed to either individual drug in the pair (in a drug combination).	Tatonetti, Nicholas P., et al. Data-driven prediction of drug effects and interactions. Science Translational Medicine. 2012.	Multi-Label/Network	4,649,441/645

Protein-Protein Interaction Prediction DatasetPPI

CLICK HERE FOR THE DATASETS!

Dataset Name	Description	Reference	Type	Stats (pairs/#proteins)
HuRI `PPI(name = 'HuRI)`	All pairwise combinations of human protein-coding genes are systematically being interrogated to identify which are involved in binary protein-protein interactions. In our most recent effort 17,500 proteins have been tested and a first human reference interactome (HuRI) map has been generated. From the Center for Cancer Systems Biology at Dana-Farber Cancer Institute. Note that the feature is peptide sequence, if a protein gene is associated with multiple peptides, we separate them by '*'.	Luck, K., Kim, D., Lambourne, L. et al. A reference map of the human binary protein interactome. Nature 580, 402–408 (2020).	Binary/Network	51,813/8,248

Peptide-MHC Binding Prediction DatasetPeptideMHC

CLICK HERE FOR THE DATASETS!

Dataset Name	Description	Reference	Type	Stats (pairs/#peptides/#ofMHCs)
MHC1_NetMHCpan `PeptideMHC(name = 'MHC1_NetMHCpan')`	Binding of peptides to MHC class I molecules (MHC-I) is essential for antigen presentation to cytotoxic T-cells. An organized datasets for MHC class I collected from IEDB and IMGT/HLA database.	Nielsen, Morten, and Massimo Andreatta. "NetMHCpan-3.0; improved prediction of binding to MHC class I molecules integrating information from multiple receptor and peptide length datasets." Genome medicine 8.1 (2016): 1-9.	Regression	185,985/43,018/150
MHC2_NetMHCIIpan `PeptideMHC(name = 'MHC2_NetMHCIIpan')`	Major histocompatibility complex class II (MHC‐II) molecules are found on the surface of antigen‐presenting cells where they present peptides derived from extracellular proteins to T helper cells. Useful to identify T‐cell epitopes. An organized datasets for MHC class II collected from IEDB database.	Jensen, Kamilla Kjaergaard, et al. "Improved methods for predicting peptide binding affinity to MHC class II molecules." Immunology 154.3 (2018): 394-406.	Regression	134,281/17,003/75

Generation

Paired Molecule GenerationMolGenPaired

CLICK HERE FOR THE DATASETS!

Dataset Name	Description	Reference	Type	Stats (#pairs/#drugs)
DRD2 `MolGenPaired(name = 'DRD2')`				34,404/21,703
QED `MolGenPaired(name = 'QED')`				88,306/52,262
logP `MolGenPaired(name = 'LogP')`				99,909/99,794
JNK3
GSK-3beta

RetrosynthesisRETRO

CLICK HERE FOR THE DATASETS!

Dataset Name Description Reference Type Stats (#drugs)

USPTO-50K
ForwardsynthesisFORWARD

CLICK HERE FOR THE DATASETS!

Dataset Name Description Reference Type Stats (#drugs)

USPTO-50K
Reaction PredictionREACT

CLICK HERE FOR THE DATASETS!

Dataset Name Description Reference Type Stats (#drugs)

USPTO-50K

Dataset Name	Description	Reference	Type	Stats (#drugs)
USPTO-50K

Dataset Name	Description	Reference	Type	Stats (#drugs)
USPTO-50K

Dataset Name	Description	Reference	Type	Stats (#drugs)
USPTO-50K

Data Split

To retrieve the dataset split, you could simply type

data = X(name = Y)
data.get_split(seed = 'benchmark')
# {'train': df_train, 'val': df_val, ''test': df_test}

You can specify the splitting method, random seed, and split fractions in the function by e.g. data.get_split(method = 'cold_drug', seed = 1, frac = [0.7, 0.1, 0.2]). For drug property prediction, a scaffold split function is also provided. Simply set method = 'scaffold'.

Benchmark and Leaderboard

We are actively working on a more systematic way to benchmark and leaderboard methods. We would release this feature in the next version. In the meantime, if you have expertise or interest in helping build this feature, please send emails to kexinhuang@hsph.harvard.edu.

Examples: How to Make Predictions

TDC is designed to rapidly conduct experiments. The data output can be directly used for powerful prediction packages. Here, we show how to use DeepPurpose for more advanced drugs/proteins encoders such as MPNN, Transformers and etc.

Using DeepPurpose

CLICK HERE FOR THE CODE!

Contribute

TDC is designed to be a community-driven effort. We know DrugDataLoader only covers tip of iceberg of the data out there. You can easily upload your data by simply writing a function that takes the expected input and output. See step-by-step instruction in the CONTRIBUTE page.

Contact

Send emails to kexinhuang@hsph.harvard.edu or open an issue.

Disclaimer

TDC is an open-source effort. Many datasets are aggregated from various public website sources. We use the Attribution-NonCommercial-ShareAlike 4.0 International license to suffice many datasets requirement. If it still infringes the copyright of the dataset author, please let us know and we will take it down ASAP.

CAVED123/TDC-DATASET

Features

Example

Installation

Cite

Core Data Overview

Property Prediction

Lipophilicity

Solubility

Absorption

Distribution

Metabolism

Excretion

Interaction Prediction

Generation

Data Split

Benchmark and Leaderboard

Examples: How to Make Predictions

Using DeepPurpose

Contribute

Contact

Disclaimer