mims-harvard/TDC

Duplicates in ADMET group test set with different classification labels

eereenah-fast opened this issue · 2 comments

Hello!

We have found some duplicates in the BBB_Martins dataset of ADMET group. It appears that some of the duplicates have different classification labels:

group = admet_group(path = 'data/')
benchmark = group.get('BBB_Martins')
train_val, test = benchmark['train_val'], benchmark['test']

print("Train Data Example 1:")
print(train_val.iloc[1573])
print(train_val.iloc[1574])

print("\nTrain Data Example 2:")
print(train_val.iloc[1377])
print(train_val.iloc[1378])

print("\nTest Data Example 1:")
print(test.iloc[175])
print(test.iloc[176])

print("\nTest Data Example 2:")
print(test.iloc[350])
print(test.iloc[351])
Train Data Example 1:
Drug_ID                                      loratadine
Drug       CCOC(=O)N1CCC(=C2c3ccc(Cl)cc3CCc3cccnc32)CC1
Y                                                     0
Name: 1573

Drug_ID                                      loratadine
Drug       CCOC(=O)N1CCC(=C2c3ccc(Cl)cc3CCc3cccnc32)CC1
Y                                                     1
Name: 1574, dtype: object

--------------------------------
Train Data Example 2:
Drug_ID                                             BRL53080
Drug       CN(C)C(=O)C(CCN1CCC(O)(c2ccc(Cl)cc2)CC1)(c1ccc...
Y                                                          1
Name: 1377, dtype: object

Drug_ID                                           loperamide
Drug       CN(C)C(=O)C(CCN1CCC(O)(c2ccc(Cl)cc2)CC1)(c1ccc...
Y                                                          0
Name: 1378, dtype: object

--------------------------------
Test Data Example 1:
Drug_ID                                     Miconazole
Drug       Clc1ccc(COC(Cn2ccnc2)c2ccc(Cl)cc2Cl)c(Cl)c1
Y                                                    0

Name: 175, dtype: object
Drug_ID                                     miconazole
Drug       Clc1ccc(COC(Cn2ccnc2)c2ccc(Cl)cc2Cl)c(Cl)c1
Y                                                    1
Name: 176, dtype: object

-------------------------------
Test Data Example 2:
Drug_ID                            mequitazine
Drug       c1ccc2c(c1)Sc1ccccc1N2CC1CN2CCC1CC2
Y                                            1
Name: 350, dtype: object

Drug_ID                            mequitazine
Drug       c1ccc2c(c1)Sc1ccccc1N2CC1CN2CCC1CC2
Y                                            0
Name: 351, dtype: object

Hi! We suspect this is due to the experiment readouts difference for the same drug. We will provide an additional function with several options on how to filter. It is similar to the DTI harmonizing function: https://tdcommons.ai/multi_pred_tasks/dti/#bindingdb

Sorry for the late update! But now you can take the max/min/remove the duplicated readouts by simply call:

from tdc.single_pred import ADME
data = ADME(name = 'BBB_Martins')
data.harmonize('remove_all') # 'max'/'min'

See more details: fc3e55e