Duplicates in ADMET group test set with different classification labels
eereenah-fast opened this issue · 2 comments
eereenah-fast commented
Hello!
We have found some duplicates in the BBB_Martins
dataset of ADMET group. It appears that some of the duplicates have different classification labels:
group = admet_group(path = 'data/')
benchmark = group.get('BBB_Martins')
train_val, test = benchmark['train_val'], benchmark['test']
print("Train Data Example 1:")
print(train_val.iloc[1573])
print(train_val.iloc[1574])
print("\nTrain Data Example 2:")
print(train_val.iloc[1377])
print(train_val.iloc[1378])
print("\nTest Data Example 1:")
print(test.iloc[175])
print(test.iloc[176])
print("\nTest Data Example 2:")
print(test.iloc[350])
print(test.iloc[351])
Train Data Example 1:
Drug_ID loratadine
Drug CCOC(=O)N1CCC(=C2c3ccc(Cl)cc3CCc3cccnc32)CC1
Y 0
Name: 1573
Drug_ID loratadine
Drug CCOC(=O)N1CCC(=C2c3ccc(Cl)cc3CCc3cccnc32)CC1
Y 1
Name: 1574, dtype: object
--------------------------------
Train Data Example 2:
Drug_ID BRL53080
Drug CN(C)C(=O)C(CCN1CCC(O)(c2ccc(Cl)cc2)CC1)(c1ccc...
Y 1
Name: 1377, dtype: object
Drug_ID loperamide
Drug CN(C)C(=O)C(CCN1CCC(O)(c2ccc(Cl)cc2)CC1)(c1ccc...
Y 0
Name: 1378, dtype: object
--------------------------------
Test Data Example 1:
Drug_ID Miconazole
Drug Clc1ccc(COC(Cn2ccnc2)c2ccc(Cl)cc2Cl)c(Cl)c1
Y 0
Name: 175, dtype: object
Drug_ID miconazole
Drug Clc1ccc(COC(Cn2ccnc2)c2ccc(Cl)cc2Cl)c(Cl)c1
Y 1
Name: 176, dtype: object
-------------------------------
Test Data Example 2:
Drug_ID mequitazine
Drug c1ccc2c(c1)Sc1ccccc1N2CC1CN2CCC1CC2
Y 1
Name: 350, dtype: object
Drug_ID mequitazine
Drug c1ccc2c(c1)Sc1ccccc1N2CC1CN2CCC1CC2
Y 0
Name: 351, dtype: object
kexinhuang12345 commented
Hi! We suspect this is due to the experiment readouts difference for the same drug. We will provide an additional function with several options on how to filter. It is similar to the DTI harmonizing function: https://tdcommons.ai/multi_pred_tasks/dti/#bindingdb
kexinhuang12345 commented
Sorry for the late update! But now you can take the max/min/remove the duplicated readouts by simply call:
from tdc.single_pred import ADME
data = ADME(name = 'BBB_Martins')
data.harmonize('remove_all') # 'max'/'min'
See more details: fc3e55e