mims-harvard/TDC

ADMET leaderboard: some molecules appear both in the `train_val` and `test` sets

agamemnonc opened this issue · 1 comments

Description
I am looking into the ADMET group in single instance prediction. For some of the datasets, a small subset of molecules appear both in the train_val and test datasets.

To Reproduce
The following:

from tdc.benchmark_group import admet_group


group = admet_group(path = '/opt/project/data/')
overlap = dict()
for benchmark in group:
    train = benchmark["train_val"]["Drug_ID"].values
    test = benchmark["test"]["Drug_ID"].values
    overlap[benchmark['name']] = [id_ for id_ in train if id_ in test]

for dset in overlap:
    print(dset, len(overlap[dset]))

prints:

caco2_wang 12
hia_hou 0
pgp_broccatelli 1
bioavailability_ma 0
lipophilicity_astrazeneca 0
solubility_aqsoldb 5
bbb_martins 2
ppbr_az 0
vdss_lombardo 0
cyp2d6_veith 0
cyp3a4_veith 0
cyp2c9_veith 0
cyp2d6_substrate_carbonmangels 0
cyp3a4_substrate_carbonmangels 0
cyp2c9_substrate_carbonmangels 0
half_life_obach 0
clearance_microsome_az 0
clearance_hepatocyte_az 0
herg 0
ames 0
dili 0
ld50_zhu 11

If we look for example into the blood-brain barrier dataset:

print(overlap['bbb_martins'])

we get the following molecules:

['Bretyliumtosilate', 'rifampin']

Expected behavior
The union of train_val and test sets should be empty for all datasets

Environment:

  • OS: Linux (via Docker) (should not matter)
  • Python version: 3.9
  • TDC version: 0.4.1

Thanks for raising this important issue! I think there are some drug ID naming issues. If you look at overlaps between molecules SMILES, e.g.:

from tdc.benchmark_group import admet_group

group = admet_group()
overlap = dict()
for benchmark in group:
    train = benchmark["train_val"]["Drug"].values
    test = benchmark["test"]["Drug"].values
    overlap[benchmark['name']] = [id_ for id_ in train if id_ in test]

for dset in overlap:
    print(dset, len(overlap[dset]))

You can see there will be 0 overlaps:

caco2_wang 0
hia_hou 0
pgp_broccatelli 0
bioavailability_ma 0
lipophilicity_astrazeneca 0
solubility_aqsoldb 0
bbb_martins 0
ppbr_az 0
vdss_lombardo 0
cyp2d6_veith 0
cyp3a4_veith 0
cyp2c9_veith 0
cyp2d6_substrate_carbonmangels 0
cyp3a4_substrate_carbonmangels 0
cyp2c9_substrate_carbonmangels 0
half_life_obach 0
clearance_microsome_az 0
clearance_hepatocyte_az 0
herg 0
ames 0
dili 0
ld50_zhu 0