theislab/chemCPA

ood drug belinostat leaks to pretraining dataset

Closed this issue · 1 comments

I don't know if you are aware of this. of the 32 ood drugs you designated in sciplex_ood_splits.ipynb, meant for fine-tuning, already exists in the pretraining data which is splitted from lincs_full_smiles_sciplex_genes.h5ad.

On close examination. there are 5 ood drugs in the lincs_sciplex pre-training dataset.

So, if the fine-tuning model loads the pre-trained model then there is a leak through and is not a true ood.

Hi @bhomass,

Thanks for pointing this out! I will provide new checkpoints in an updated version of this repo where the LINCS data matches the single-cell setting better. While this is not ideal, I checked the number of data points corresponding to these drugs and they are less than 0.3%. Given the strong shift between build and single-cell, I am confident that the results still translate to the "true" old case.