parse_pubmed_caption() failing on some papers
oblodgett opened this issue · 0 comments
oblodgett commented
When parsing certain files for image captions:
import pubmed_parser as pp
pubmed_figuredata = pp.parse_pubmed_caption("PMC9539395.nxml")
Fails with the following error:
_process.py
Traceback (most recent call last):
File "test_process.py", line 18, in <module>
pubmed_figuredata = pp.parse_pubmed_caption(paper_path)
File "venv_sentence_parsing/lib/python3.8/site-packages/pubmed_parser/pubmed_oa_parser.py", line 425, in parse_pubmed_caption
fig_label = stringify_children(fig.find("label"))
File "venv_sentence_parsing/lib/python3.8/site-packages/pubmed_parser/utils.py", line 51, in stringify_children
[node.text]
AttributeError: 'NoneType' object has no attribute 'text'
I would expect this to parse correctly? Also when parsing image captions the subpoints under the caption label are not available in the output, see that same paper PMC9539395 as an example.