titipata/pubmed_parser

parse_pubmed_paragraph() function seems to miss some paragraphs sometimes.

zhao-zy15 opened this issue · 0 comments

Describe the bug
I was preparing for a dataset requiring paragraph-level parsing of PMC_OA articles. However, when I try to parse this article with PMC id PMC8075838, there are actually 12 paragraphs in the article but parse_pubmed_paragraph() function returns only 7 paragraphs. Any ideas why? (I have checked the original xml file on my laptop and there is no missing paragraph in the file)

Screenshots
image