openforcefield/openff-toolkit

openff.toolkit.Molecule unable to parse molecule in .sdf file, but is ok with parsing from rdkit

David-Araripe opened this issue · 4 comments

I have a .sdf file that I want to parse using openff.toolkit.Molecule. In general, I don't have problems with reading molecules like this.

However, this was an exception. When trying to read the attached .sdf file with the method Molecule.from_file(fpath, allow_undefined_stereo=True), I got the following logging message:

[07:20:59] Explicit valence for atom # 2 O, 3, is greater than permitted
68 Explicit valence for atom # 2 O, 3, is greater than permitted

Followed by the error:

in FrozenMolecule.from_file(cls, file_path, file_format, toolkit_registry, allow_undefined_stereo)
   3912     mols = toolkit.from_file_obj(
   3913         file_obj,
   3914         file_format=file_format,
   3915         allow_undefined_stereo=allow_undefined_stereo,
   3916         _cls=cls,
   3917     )
   3919 if len(mols) == 0:
-> 3920     raise MoleculeParseError(f"Unable to read molecule from file: {file_path}")
   3921 elif len(mols) == 1:
   3922     return mols[0]

MoleculeParseError: Unable to read molecule from file: 68_27_aligned.sdf

Solves it for the moment
A workaround that I found to work is loading the molecule by doing:

fpath = "68_27_aligned.sdf"
supplier = Chem.SDMolSupplier(fpath)[0]
molecule = Molecule.from_rdkit(supplier)

Is this the desired behavior from the Molecule class? If so, is there some parameter I can pass so that I can still read he molecule file using the from_file() method?

To reproduce the error & get the version I'm using

from openff import toolkit
from openff.toolkit 
toolkit._version.get_versions()
fpath = "68_27_aligned.sdf"
molecule = Molecule.from_file(fpath, allow_undefined_stereo=True)

Where the version details are:

{'date': '2024-04-12T12:59:25-0700',
 'dirty': False,
 'error': None,
 'full-revisionid': '3bb4d9faa0cb481770c0974a845581f98629fb45',
 'version': '0.16.0'}

Forgot the send the file, sorry:
68_27_aligned.txt

I don't think you're doing anything wrong; I can load it using the OEChem backend and it looks like a sensible molecule in Avogadro as well. The toolkit internally has a different way of going through to the SDF supplier which is probably the cause of this trivalent oxygen you're seeing (?)

sdf_supplier = Chem.ForwardSDMolSupplier(
file_path, removeHs=False, sanitize=False, strictParsing=True
)
/
def _process_sdf_supplier(self, sdf_supplier, allow_undefined_stereo, _cls):

Specifically here's where the failure bubbles up from:

Chem.SanitizeMol(
rdmol,
Chem.SANITIZE_ALL
^ Chem.SANITIZE_SETAROMATICITY
^ Chem.SANITIZE_ADJUSTHS,
)

I don't know why this is happening - Jeff, Josh, or somebody else may drop by and actually answer your question 😅

Indeed, there was something wrong with the molecule I sent. I added hydrogens at some point of processing, but I noticed that was wrong in the end, as my molecules had an "empty" valence configuration. Once I corrected this issue, I didn't see this bug anymore! Thank you very much for having a look into this!

If it's something you'd like to further investigate, fine, but from my side I'm fine with having this issue closed

Thanks for writing in and for the update, @David-Araripe. I'll mark this as closed.