titipata/pubmed_parser

parse_pubmed_caption() failing on some papers

oblodgett opened this issue · 0 comments

When parsing certain files for image captions:

import pubmed_parser as pp

pubmed_figuredata = pp.parse_pubmed_caption("PMC9539395.nxml")

Fails with the following error:

_process.py     
Traceback (most recent call last):
  File "test_process.py", line 18, in <module>
    pubmed_figuredata = pp.parse_pubmed_caption(paper_path)
  File "venv_sentence_parsing/lib/python3.8/site-packages/pubmed_parser/pubmed_oa_parser.py", line 425, in parse_pubmed_caption
    fig_label = stringify_children(fig.find("label"))
  File "venv_sentence_parsing/lib/python3.8/site-packages/pubmed_parser/utils.py", line 51, in stringify_children
    [node.text]
AttributeError: 'NoneType' object has no attribute 'text'

I would expect this to parse correctly? Also when parsing image captions the subpoints under the caption label are not available in the output, see that same paper PMC9539395 as an example.