Data splitting with enumerated SMILES
lorenzoFabbri opened this issue · 2 comments
I'm trying to use LSTMs to predict a molecular property. I was writing my own code but then I found out that OpenChem has more or less everything I need already.
I have a question regarding data splitting. I must say I did not go over the entire library.
When I was using my own code, I decided to use SMILES enumeration since my dataset is rather small. In doing so, I was wondering whether to keep all the SMILES of the same compound in the same set (either training or validation). It seems that OpenChem does not take this into consideration and the split is done randomly (SMILES codes of the same compound can appear both in the training set and the validation set). Is my understanding correct? If so, isn't this a form of data leakage?
Thank you.
Dear Lorenzo:
Thanks for using OpenChem! Please let us know about your experience.
It seems that OpenChem does not take this into consideration and the split is done randomly
It's a choice of a practitioner:) Our intention that SMILES augmentation should be applied either after the split or on-the-fly during the actual training.
Thanks for the quick response.
Taking for instance the provided examples (e.g., Tox21), if I understand correctly, the compounds in the training set are enumerated while the compounds in the validation set are not. Correct?
Have you tried enumerating also the compounds in the validation set, and perhaps averaging the predictions for each compound?
To be honest, I was not able to make it work with my dataset. It's extremely similar to Tox21 (CSV file with label + SMILES) but I keep getting many errors. Unfortunately I did not keep track of all of them: a recurring one was RuntimeError: cuda runtime error: device-side assert triggered at...
. Also, the provided code for Tox21 does not seem to work when the batch size is 1. I'll try again tomorrow. I think we can close this issue, though.