Dataset
Closed this issue · 1 comments
1treu1 commented
Good evening, greetings from Colombia. I find very interesting your work on MolTrans and the TDC library, it saves a lot of work.
I was reviewing the MolTrans Datasets, and I compared them with the datasets of the TDC library. Specifically DAVIS and BindingDB; I realized that the SMILES are different, I explain:
In the MolTrans SMILES string the "=" symbol appears when there is a link between elements, but these do not appear in the SMILES string of the TDC library.
- Could you explain me the reason for this? or is it a mistake? or is it another type of representation?
- Do you think that if I transform them to SELFIES I could solve this problem and thus be able to compare the datasets?
kexinhuang12345 commented
Hi, sorry for the late reply. I don't think they are the same drug though. "=" means double bond, so it should appear in both datasets. To be on the safe side, you could transform them to canonicalized versions:
Chem.MolToSmiles(Chem.MolFromSmiles('C1=CC=CN=C1'))