mims-harvard/TDC

ADMET Plasma Protein Binding dataset is incorrectly translated from the source (AstraZenica)

EvanKomp opened this issue · 4 comments

Describe the bug
The PBBR dataset from AstraZenica in the ADMET TDC group is described as binding in human plasma. I believe approximately 1200 datapoints are actually in other species. PPBR can be highly variable between species, and I cannot see a situation where the different organism plasma used in this data should not be treated explicitly.

To Reproduce
TDC downloaded via:

from tdc.single_pred import ADME
data = ADME(name = 'PPBR_AZ')
tdc = data.get_data()

Which is a dataframe of length 2790

Raw data downloaded from Chembl: https://www.ebi.ac.uk/chembl/g/#browse/activities/filter/document_chembl_id%3ACHEMBL3301361%20AND%20standard_type%3A(%22PPB%22)

and loaded into a df raw of length 2828.

Group each by Chembl id and record the data as follows:

raw_results = {}
for g in raw.groupby('Molecule ChEMBL ID'):
    if len(g[1]) > 1:
        raw_results[g[0]] = dict(zip(g[1]['Assay Organism'].values, g[1]['Standard Value'].values))
tdc_results = {}
for g in tdc.groupby('Drug_ID'):
    if len(g[1]) > 1:
        tdc_results[g[0]] = g[1]['Y'].values

for k, v in tdc_results.items():
    print('#######################')
    if k in raw_results:
        print(k)
        print('TDC reports values: ', v)
        print('Data from AZ: ', raw_results[k])
    else:
        print(k, ' Not in raw data')

Producing an output for each chembl id with multiple data points as in screenshot.

Expected behavior
Data labeled as PPBR in humans should not contain data in other species.

Screenshots
image

Environment:
NA

Additional context

Hi! Thanks for pointing this out!

We would change the description to ADMET prediction for PPBR regardless of species, where we also provide metadata about the species of each data point for interested users.

Would that work? since each species would have little data individually, I am not sure if splitting on individual species would limit ML model to make a powerful prediction. It could be an interesting test dataset for cross-species though. What are your thoughts?

Hey Kexin!

The second portion of the suggestions is, I believe, a good way to go. However I think the first part is a bit off - the PPBR across species can vary substantially, I can find some papers if interested.

This means that any predictor trained on an amalgamation of species without knowledge of the species itself will never have any predictive power. For example, one compound in the dataset reports ~40% binding in mouse and 97% in human, yet the model sees those two points as two measurements with identical X. The aleatoric uncertainty associated with treating all species as the same target means that the ceiling for model performance is incredibly low. This can be seen in the leaderboards with the best models getting ~8% MAE. Predicting the mean in the dataset produces 15%. Any predictive power is likely being carried by the majority class (human) which has 1.6k datapoints. I also point out - even with a perfect model for predicting eg. the average PPBR across 5 species - the use of the model is very low. It is unlikely that anyone would ever need that sort of "cross species" value. What is useful is being able to predict PPBR in a particular species accurately.

Providing species labels would probably be most useful to users, but in order to align with the easy-to-use api and to keep the dataset as a single target task, dropping the ~1.2k datapoints from other species and leaving only the 1.6k from humans is probably the best call. There is also a reasonable amount of rat data (~700), but the other species are too few to be learnable IMO.

Thanks! That makes lots of sense. I think in default, it would return the 1.6K human samples while leaving the rest (along with the species data) available as well. The rest could be retrieved through an auxiliary function, in case some users need it. Will follow up once we implemented that!

This is fixed in #170! Now, the default returns the homo sapiens subset and you can also get additional species using

from tdc.single_pred import ADME
data = ADME(name = 'PPBR_AZ')
data.get_other_species('Rattus norvegicus')