PattanaikL/GeoMol

Questions about stereoisomer issues in the evaluation of GeoMol

Opened this issue · 0 comments

def clean_confs(smi, confs):
good_ids = []
smi = Chem.MolToSmiles(Chem.MolFromSmiles(smi), isomericSmiles=False)
for i, c in enumerate(confs):
conf_smi = Chem.MolToSmiles(Chem.RemoveHs(c), isomericSmiles=False)
if conf_smi == smi:
good_ids.append(i)
return [confs[i] for i in good_ids]

  • This function used to filter out conformers with inconsistent smiles relative to the given smiles (in this script this is corrected_smi). In my reproduction, most cases that the inconsistency exists are molecules with a Z/E-double bond. These cases will not be filtered out if isomericSmiles=False, which makes me confused and I'm not sure if this is a mistake.
  • For example, now conformers with smiles Cc1cc(C(=O)c2cnc(/N=C/N(C)C)s2)c(F)cc1Cl and Cc1cc(C(=O)c2cnc(/N=C\N(C)C)s2)c(F)cc1Cl in reference data will all be saved for comparison although GeoMol was used to only generate conformers with Cc1cc(C(=O)c2cnc(/N=C\N(C)C)s2)c(F)cc1Cl.

if conf_canonical_smi != canonical_smi:
continue

  • Compared with that, the code in model/featurization.py filtered out the conformers with inconsistent smiles relative to the smiles in the dataset.
  • So actually, if I used compare_confs.py to calculate the performance with isomericSmiles=False, the conformers with different isomeric SMILES will not be filtered out and the performance was the same as or even worse than before (since that GeoMol was used to generate only one stereoisomer based on the given SMILES).
  • The performance comparison between GeoMol prediction and reference data (before using clean_confs; using clean_confs; change isomericSmiles=True:
**Before**
Recall Coverage: Mean = 74.78, Median = 85.00
Recall AMR: Mean = 0.9471, Median = 0.9176
Precision Coverage: Mean = 71.84, Median = 87.50
Precision AMR: Mean = 1.0035, Median = 0.9649

**After (with clean_confs, more confs are included than before)**
Recall Coverage: Mean = 74.30, Median = 90.00
Recall AMR: Mean = 0.9489, Median = 0.8797
Precision Coverage: Mean = 65.50, Median = 81.80
Precision AMR: Mean = 1.1044, Median = 1.0041

**isomericSmiles=True**
Recall Coverage: Mean = 83.38, Median = 100.00
Recall AMR: Mean = 0.8233, Median = 0.8079
Precision Coverage: Mean = 72.73, Median = 87.50
Precision AMR: Mean = 0.9833, Median = 0.8895

As you can see, if isomericSmiles=True, the performance in GeoMol paper's result can be reproduced.


When I tried to walk further related to this issue, I found another weird thing that GeoMol will generate the conformers close in 3D geometry though with different stereoisomerism in SMILES as input. And the conformers close in 3D geometry are different stereoisomers in their SMILES. This issue does not exist in RDKit ETKDG and I am not sure if it will affect GeoMol's performance on these molecules. Here I give two examples on that,

SMILES GeoMol (trans) GeoMol (cis) ETKDG (trans) ETKDG (cis)
O=S(=O)(_N=C(_c1ccccc1)N1CCOCC1)c1ccc(Br)cc1 image image image image
Cc1cc(C(=O)c2cnc(_N=C_N(C)C)s2)c(F)cc1Cl image image image image