Questions about stereoisomer issues in the evaluation of GeoMol
Opened this issue · 0 comments
qcxia20 commented
GeoMol/scripts/compare_confs.py
Lines 49 to 56 in 5d0e850
- This function used to filter out conformers with inconsistent smiles relative to the given smiles (in this script this is corrected_smi). In my reproduction, most cases that the inconsistency exists are molecules with a Z/E-double bond. These cases will not be filtered out if
isomericSmiles=False
, which makes me confused and I'm not sure if this is a mistake. - For example, now conformers with smiles
Cc1cc(C(=O)c2cnc(/N=C/N(C)C)s2)c(F)cc1Cl
andCc1cc(C(=O)c2cnc(/N=C\N(C)C)s2)c(F)cc1Cl
in reference data will all be saved for comparison although GeoMol was used to only generate conformers withCc1cc(C(=O)c2cnc(/N=C\N(C)C)s2)c(F)cc1Cl
.
Lines 125 to 126 in 5d0e850
- Compared with that, the code in
model/featurization.py
filtered out the conformers with inconsistent smiles relative to the smiles in the dataset. - So actually, if I used
compare_confs.py
to calculate the performance withisomericSmiles=False
, the conformers with different isomeric SMILES will not be filtered out and the performance was the same as or even worse than before (since that GeoMol was used to generate only one stereoisomer based on the given SMILES). - The performance comparison between GeoMol prediction and reference data (before using clean_confs; using clean_confs; change
isomericSmiles=True
:
**Before**
Recall Coverage: Mean = 74.78, Median = 85.00
Recall AMR: Mean = 0.9471, Median = 0.9176
Precision Coverage: Mean = 71.84, Median = 87.50
Precision AMR: Mean = 1.0035, Median = 0.9649
**After (with clean_confs, more confs are included than before)**
Recall Coverage: Mean = 74.30, Median = 90.00
Recall AMR: Mean = 0.9489, Median = 0.8797
Precision Coverage: Mean = 65.50, Median = 81.80
Precision AMR: Mean = 1.1044, Median = 1.0041
**isomericSmiles=True**
Recall Coverage: Mean = 83.38, Median = 100.00
Recall AMR: Mean = 0.8233, Median = 0.8079
Precision Coverage: Mean = 72.73, Median = 87.50
Precision AMR: Mean = 0.9833, Median = 0.8895
As you can see, if isomericSmiles=True
, the performance in GeoMol paper's result can be reproduced.
When I tried to walk further related to this issue, I found another weird thing that GeoMol will generate the conformers close in 3D geometry though with different stereoisomerism in SMILES as input. And the conformers close in 3D geometry are different stereoisomers in their SMILES. This issue does not exist in RDKit ETKDG and I am not sure if it will affect GeoMol's performance on these molecules. Here I give two examples on that,
SMILES | GeoMol (trans) | GeoMol (cis) | ETKDG (trans) | ETKDG (cis) |
---|---|---|---|---|
O=S(=O)(_N=C(_c1ccccc1)N1CCOCC1)c1ccc(Br)cc1 | ||||
Cc1cc(C(=O)c2cnc(_N=C_N(C)C)s2)c(F)cc1Cl |