prody/ProDy

Errors when parsing mmCIF structures with mismatching chain ids

sforli opened this issue · 13 comments

There is at least one case in which when loading structures where the chain id assigned by the PDB (whatever that means) is different than the one assigned by the authors, ProDy raises exceptions. One example is 8GLV, and this information is provided when hovering on the littie "i" symbol next to the chain in the PDB page for this entry.

Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:39:03)
[GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import prody
>>> prody.parseMMCIF("8glv")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/miniconda3/envs/chem/lib/python3.9/site-packages/prody/proteins/ciffile.py", line 117, in parseMMCIF
    result = parseMMCIFStream(cif, chain=chain, **kwargs)
  File "/home/user/miniconda3/envs/chem/lib/python3.9/site-packages/prody/proteins/ciffile.py", line 188, in parseMMCIFStream
    hd = getCIFHeaderDict(lines)
  File "/home/user/miniconda3/envs/chem/lib/python3.9/site-packages/prody/proteins/cifheader.py", line 168, in getCIFHeaderDict
    value = func(lines)
  File "/home/user/miniconda3/envs/chem/lib/python3.9/site-packages/prody/proteins/cifheader.py", line 812, in _getPolymers
    poly = polymers[ch]
KeyError: '0A'
>>>

I have no clue on what this mismatching means, but it's out there, and chances are it's going to be more frequent as larger structures get resolved.

Thanks. We’ll have a look.

perhaps one option here could be a try/except block

I dug a bit deeper using 8glv as an example and I found multiple potential issues, which seem to be all located in the _getPolymers() function in cifheader.py.

The error I reported seems to be triggered when parsing the entry _entity_poly, which appears to retain at least an extra semicolon at the beginning of _entity_poly.pdbx_strand_id, so when extracting chains:

chains = item["_entity_poly.pdbx_strand_id"]

the first chain id ends up being ;A0 instead of A0. The same seems to happen to the end of the line, with what looks an extra semicolon.

After removing those semicolons, another problem popped up when parsing the _struct_ref_seq, where a missing chain triggers a KeyError exception:

KeyError: 'ME'

Thank you for checking this. We can take a look at that too

ok, I've made some fixes in #1847 and now the header from 8glv parses fine:

  1. replace ; with nothing to fix the issue with 0A
  2. replace space with nothing to fix issue with ME
  3. increase gap insertion penalty for aligning missing residues to fix another problem that then came up

That was quick, thanks!
I'll give it a try as soon as possible, and I'll give you my feedback. I have a couple of more odd structures I can throw at it and see if it holds water.

Sounds good. Thanks

I haven't merged these changes yet, so you'll need to checkout the branch that has them or something equivalent.

I just saw this. Could you point me to the specific checkout I should take?

Yes, it's https://github.com/jamesmkrieger/ProDy/tree/cifheader_fixes

I think you should add a remote and then checkout like this:

git remote add james https://github.com/jamesmkrieger/ProDy
git checkout -b cifheader_fixes james/cifheader_fixes

Thanks! I've rested it with a few small and large CIF files and it seems to work well.

I noticed you added the string replacements when accessing the field info.
I'm not sure if the string to be fixed is an error in the mmCIF/STAR format, but if that's the case, would you consider adding those replacements at parsing time, so not every function should take care of sanitizing the fields it is accessing?
Unless this is just a test fix before committing to a more radical parsing change.

Glad to hear this works.

I'm not sure I understand what you mean. I think mmCIF has its own peculiarities compared to general STAR format and different kinds of problems probably do creep up for particular fields, so I don't think we can put it in the general STAR parser functions. Nevertheless, this is still a parsing function for the mmCIF headers, so I imagine it should be ok.

We can keep more radical changes in mind if we see we have systematic problems.

Great, thanks for helping with this.
What I meant was that my understanding is that your fix is not just a parsing fix, but STAR fields content shouldn't actually start with a semicolon.

I am fine with this solution, but I was worried that any new parsing function that would be added to the code would have to apply the fix (...again, assuming it's not just a need for your internal parsing).

Ok, thanks. That’s something we can bear in mind

Tagging @atbogetti and @karolamik13 so you’ve seen it too in case we come across a more general issue