ood drug belinostat leaks to pretraining dataset
Closed this issue · 1 comments
I don't know if you are aware of this. of the 32 ood drugs you designated in sciplex_ood_splits.ipynb, meant for fine-tuning, already exists in the pretraining data which is splitted from lincs_full_smiles_sciplex_genes.h5ad.
On close examination. there are 5 ood drugs in the lincs_sciplex pre-training dataset.
So, if the fine-tuning model loads the pre-trained model then there is a leak through and is not a true ood.
Hi @bhomass,
Thanks for pointing this out! I will provide new checkpoints in an updated version of this repo where the LINCS data matches the single-cell setting better. While this is not ideal, I checked the number of data points corresponding to these drugs and they are less than 0.3%. Given the strong shift between build and single-cell, I am confident that the results still translate to the "true" old case.