kexinhuang12345/MolTrans

Dataset

Closed this issue · 1 comments

Good evening, greetings from Colombia. I find very interesting your work on MolTrans and the TDC library, it saves a lot of work.
I was reviewing the MolTrans Datasets, and I compared them with the datasets of the TDC library. Specifically DAVIS and BindingDB; I realized that the SMILES are different, I explain:
In the MolTrans SMILES string the "=" symbol appears when there is a link between elements, but these do not appear in the SMILES string of the TDC library.

  • Could you explain me the reason for this? or is it a mistake? or is it another type of representation?
  • Do you think that if I transform them to SELFIES I could solve this problem and thus be able to compare the datasets?

smilestdc

smilesmoltrans

Hi, sorry for the late reply. I don't think they are the same drug though. "=" means double bond, so it should appear in both datasets. To be on the safe side, you could transform them to canonicalized versions:

Chem.MolToSmiles(Chem.MolFromSmiles('C1=CC=CN=C1'))