OpenBioSim/biosimspace

[BUG] BSS.Convert.toRDKit() appears to lose stereochemical information

Closed this issue ยท 11 comments

Hey again,

I think I've encountered a bug with the Convert module (or I might be overlooking a crucial bit of code). Converting from a BioSimSpace / Sire mol to an rdmol via BSS.Convert.toRDKit() appears to lose stereochemical info.

To Reproduce
Steps to reproduce the behavior:

from rdkit import Chem
import BioSimSpace as BSS

# Define enantiomer
a = 'CC[C@@H](O)C'

# Check stereochemistry of the molecules
a

# Parameterise
bs_a = BSS.Parameters.gaff(a).getMolecule()

# Check the stereochemistry is retained in the parameterised molecules 
bs_a._sire_object.view()

# Convert back to rdkit
rd_a = BSS.Convert.toRDKit(bs_a)
rd_a

Expected behavior
It seems to me like there is no reason for the stereochemistry to be removed. The reason this has become an issue for me is because I'm using an rdkit-based flexible alignment method rather than rmsd/rmsdAlign, and so working with rdkit molecules that exhibit correct stereochemical centres is important. Perhaps I've missed something in the docs that'll help me achieve this.

(please complete the following information):
OS: [e.g. Linux (Ubuntu 20.04.6 LTS (Focal Fossa)
Version of Python: Python 3.12.4
Version of BioSimSpace: 2024.2.0
I confirm that I have checked this bug still exists in the latest released version of BioSimSpace: yes

Cheers,
Noah

Hi there. What's b?

NameError: name 'b' is not defined

Oops sorry was working with two molecules then cut one out when I wrote this post. Have edited the original post

I don't think your assertion that the parametrised molecule retains the stereochemistry is correct. If I do the following then everything is fine:

import BioSimSpace as BSS

# Create a molecule directly from the SMILES string.
bss_mol = BSS.Convert.smiles("CC[C@@H](O)C")

# Convert to RDKit.
rdkit_mol = BSS.Convert.toRDKit(bss_mol)

# Check the SMILES representation.
print(Chem.MolToSmiles(rdkit_mol))
'[H]O[C@@]([H])(C([H])([H])[H])C([H])([H])C([H])([H])[H]'

Converting to RDKit uses the Sire-to-RDKit converter behind the scenes. If required attributes are missing, then stereochemistry is preserved by using an algorithm from MDAnalysis, which we have ported and tested reasonably well. However, in your case you could just simply use a BioSimSpace starting molecule that has been created from SMILES, then the required attributes should be preserved throughout, for example:

from rdkit import Chem
import BioSimSpace as BSS

# Define enantiomer
a = 'CC[C@@H](O)C'

# Check stereochemistry of the molecules
a

# Parameterise
bs_a = BSS.Parameters.gaff(BSS.Convert.smiles(a)).getMolecule()

# Check the stereochemistry is retained in the parameterised molecules 
bs_a._sire_object.view()

# Convert back to rdkit
rd_a = BSS.Convert.toRDKit(bs_a)
rd_a

# Print the SMILES representation.
print(Chem.MolToSmiles(rd_a))
'[H]O[C@@]([H])(C([H])([H])[H])C([H])([H])C([H])([H])[H]'

Note the only difference here is:

bs_a = BSS.Parameters.gaff(BSS.Convert.smiles(a)).getMolecule()

This means that the molecule passed in has the required property keys, which are preserved on return, i.e.:

smiles_mol = BSS.Convert.smiles(a)
print(smiles_mol._sire_object.propertyKeys())
['chirality',
 'hybridization',
 'isotope',
 'is_aromatic',
 'coordinates',
 'mass',
 'formal_charge',
 'connectivity',
 'element']

If you just pass in a SMILES string, then these are only attached to an intermediate molecule used in the calculation. I think I might be able to update this so that they are also added. The original SMILES parsing for the parameterisation functions didn't use the internal toRDKit functionality, since it wasn't available. Instead it went via OpenFF, so there was no intermediate BioSimSpace molecule. Will see what I can do since it would be nice if this worked either way.

This is now fixed in #331.

Thanks for looking into this.

Regarding parameterised mols not containing stereo-information, how are you able to tell this? When I look at the molecule in 2D (._sire_object.view2d()) it looks like there is a wedge bond.

I'm working with SDFs and only used the butanol smiles as an example. Forgive me if I'm misinterpreting your response, or if your fix tackles SDF mols too, but it seems to me that this issue extends beyond smiles parsing.

For example:

from rdkit import Chem
import BioSimSpace as BSS

smi = 'CC[C@@H](O)C'

# Process smiles in rdkit
mol = Chem.MolFromSmiles(smi)
mol = Chem.AddHs(mol)
Chem.AllChem.EmbedMolecule(mol)

# Write to SDF
Chem.SDWriter('mol.sdf').write(mol)

# Load SDF
bs_mol = BSS.IO.readMolecules('mol.sdf')

# Parameterise
bs_mol_param = BSS.Parameters.gaff(bs_mol.getMolecule(0)).getMolecule()

# Convert to rdkit
rd_conv = BSS.Convert.to(bs_mol_param, "rdkit")

# Check smiles for stereochem
Chem.MolToSmiles(rd_conv)
'[H]OC([H])(C([H])([H])[H])C([H])([H])C([H])([H])[H]'

Every chance I'm missing something here but it feels to me like this could be a slightly different issue to what your fix concerns.

Cheers

Hmm, I see. It's strange that things look okay using the view2d function, since that's literally using Sire to convert to RDKit behind the scenes. Looking at the code here I see:

        # assign stereochemistry to the rest, and also remove
        # 3D conformers as they mess up the 2D view
        try:
            for r in rdkit_mols:
                # assign the stereochemistry from the structure
                try:
                    from rdkit.Chem.rdmolops import AssignStereochemistryFrom3D

                    AssignStereochemistryFrom3D(r)
                except Exception:
                    # does not matter if this fails
                    pass

                # we also don't want any conformers,
                # as these mess up the 2D view
                r.RemoveAllConformers()
        except Exception:
            try:
                from rdkit.Chem.rdmolops import AssignStereochemistryFrom3D

                AssignStereochemistryFrom3D(rdkit_mols)
            except Exception:
                # does not matter if this fails
                pass

So, it looks like it's using RDkit to apply stereochemistry to the structure using the Python API. Presumably this isn't done in the converter, which uses the C++ API. (I see no equivalent function call.)

No, the fix won't deal with SDF, although SDF attributes should be preserved in the parametrerised molecule and used in the conversion to RDKit.

In this sense, the view is a bit misleading since the stereochemistry is not part of the BioSimSpace molecule and wouldn't (necessarily) be present when using the Sire-to-RDKit converter.

Interesting, I wasn't aware of that RDKit function - that's actually very useful. Managed to get around my issues by using it post-conversion, since the 3D information is retained (even if chiral centres weren't being explicitly defined).

Great, glad it was helpful.

Is it okay to close this issue? I've implemented the code to preserve the SMILES information in the molecule when that was the source of input to the parametrisation. One thing that we've noticed regarding stereochemistry and the view2d function is that this almost entirely comes from the connectivity property in the system. This will be replaced by the output of the parameterisation, i.e. what comes from the AMBER files generated by AmberTools or OpenFF, so won't necessarily be the same as what was in the original molecule. This has caused confusion in some cases, e.g. where the stereochemistry implied by the connectivity from the AMBER bonding looks weird.

Okay that's useful to know, thanks for pointing that out. Original issue is all sorted on my end :)