Potential data leakage of ContextPred and AttrMasking on the ADMET group benchmark

Question

Potential data leakage of ContextPred and AttrMasking on the ADMET group benchmark

lihan97 opened this issue 2 years ago · 2 comments

The ContextPred and Attrmasking methods on the ADMET leaderboard were pre-trained on the ChEMBL dataset (https://doi.org/10.1039/C8SC00148K) in a supervised manner. The ChEMBL dataset contains 1310 biochemical assays, in which CHEMBL1741321 corresponds to CYP2D6_Veith, CHEMBL1741324 corresponds to CYP3A4_Veith, CHEMBL1741325 corresponds to CYP2C9_Veith, CHEMBL1909136 corresponds to CYP2D6_Substrate_CarbonMangels, CHEMBL1909135 corresponds to CYP2C9_Substrate_CarbonMangels and CHEMBL1909138 corresponds to CYP3A4_Substrate_CarbonMangels (see Table 2 in https://doi.org/10.1039/C8SC00148K). Please check that for the potential data leakage.

Answer 1 · 2022-07-28T16:24:26.000Z

Thanks for pointing this out. We would like to create an issue with the DGL lifesci github to make sure if they are indeed included in the pertaining procedure. If that is the case, we would add a note to these two baselines in the CYP-based benchmarks. Stay tuned

Answer 2 · 2022-08-07T06:43:12.000Z

It looks like there is indeed potential data leakage. Thanks for pointing it out! We are removing these two methods from the affected benchmarks.