SMILES output by the decoder are not canonical

Question

SMILES output by the decoder are not canonical

Closed this issue 3 years ago · 3 comments

Not a bug, maybe, but annoying.
If you do a roundtrip: SMILES -> DeepSMILES -> SMILES, you expect
the 1st input file and the last output file to be the same.
In order for this to be true, it is necessary to build the molecule from the decoded
SMILES by rdkit, then let rdkit create the SMILES to output (this one is equal to the input SMILES, if the input SMILES was made by rdkit).

Answer 1 · 2021-05-24T05:50:14.000Z

The transformation preserves the structure but not the exact form - but how could it do otherwise? All arbitrary ring closure digits for phenyl rings are converted to 6 for instance - there's no way to do the reverse. I'll add a note to the docs if this is not obvious.

Answer 2 · 2021-05-24T06:03:00.000Z

If the DeepSMILES to SMILES conversion is just made via string manipulation, I agree.
Otherwise, maybe just the ring opening/closure numbering scheme needs to be updated so that
the numbering is similar to a canonical SMILES.
Please note, I am just speaking about decoding DeepSMILES back to SMILES.

Answer 3 · 2021-05-24T07:58:10.000Z

My implementation doesn't use a cheminformatics toolkit, and is done via string manipulation. In general the numbering is similar to standard SMILES output, but SMILES implementations differ. For example, on an atom with multiple ring closure/opening digits, certain toolkits put the closure digits before the opening while others do it the other way around. Some toolkits don't reuse ring closure digits until they get to 10, while others will reuse them straightaway.