mims-harvard/TDC

Why the data in TDC is less than the original paper?

StefanIsSmart opened this issue · 1 comments

Describe your question.

Question 1:

For example, the Caco-2 Dataset has 1272 compounds;
But the TDC only has 906?

"After a series of pretreatments, 1272 compounds and their permeability values were finally collected as further analysis. Their SMILES structures and permeability values can be found in SI1." -------From the Paper.

Question 2:

The SMILES in TDC is not same as the original dataset.
Dataset Caco2 for example, the SMILES of compound "dexamethasone b D glucuronide" in original dataset is F[C@@]12[C@H]([C@@H]3C[C@@H](C)[C@](O)(C(=O)CO[C@H]4O[C@@H](C(O)=O)[C@H](O)[C@@H](O)[C@@H]4O)[C@]3(C[C@@H]1O)C)CCC1=CC(=O)C=C[C@@]12C
but in your dataset is : C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO[C@H]1O[C@@H](C(=O)O)[C@H](O)[C@@H](O)[C@@H]1O

What's the difference? Are you sure the different SMILES in the original dataset and TDC dataset are the same mol. ?

Question 3:

How to know the baseline performance of your collected dataset?

Hi! Thanks for the great questions!

A1: please follow the data for the exact number, we have conducted additional filters since the paper.

A2: We conduct canonicalization using rdkit. These two SMILES are the same molecular structure just different SMILES strings.

A3: checkout the baseline result here: https://tdcommons.ai/benchmark/admet_group/01caco2/