ValueError when attempting to parse OA XML
mazzespazze opened this issue · 3 comments
Describe the bug
I downloaded the XML gz file "oa_comm_xml.incr.2023-06-20.tar.gz" you can find here: https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/oa_comm/xml/.
Full link: https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/oa_comm/xml/oa_comm_xml.incr.2023-06-20.tar.gz.
Python code:
tar = tarfile.open(fileobj=fileobj)
for i, member in enumerate(tar.getmembers()):
f = tar.extractfile(member)
stream = ""
if f is not None:
try:
content = f.read().decode("utf-8")
stream += content
except UnicodeError:
continue
pmc_dict = pp.parse_pubmed_xml(stream)
Error:
tree = etree.fromstring(path) File "src/lxml/etree.pyx", line 3254, in lxml.etree.fromstring File "src/lxml/parser.pxi", line 1908, in lxml.etree._parseMemoryDocument ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
To Reproduce
Try to get to parse the file I put as a link with parse_pubmed_xml
.
Expected behavior
I was expecting a dictionary as in the other cases.
Dependencies
The ones on this package + tarfile and gzip.
Additional context
I want to parse each XML in https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/oa_comm/xml/
Thanks for the details for reproducing the problem with code and data. Not all of the file in the …/xml/ folder are the same.
head -2 *.xml
==> PMC9933422.xml <==
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article
==> PMC9942033.xml <==
<!DOCTYPE article
PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD with MathML3 v1.3 20210610//EN" "JATS-archivearticle1-3-mathml3.dtd">
I recommend un-tarring and then iterating on the XML files produced. The file inside the tar file triggering the error is PMC9933422.xml
# import json
import pubmed_parser as pp
# treat this file as XML fragment
filename = 'PMC9942033.xml'
file = open(filename, "rb")
content = file.read().decode("utf-8")
pmc_dict = pp.parse_pubmed_xml(content)
first_element = next(iter(pmc_dict))
print(f'{pmc_dict[first_element]}')
# output: Tuberculosis in older adults: case studies from four countries with rapidly ageing populations in the western pacific region
# print(json.dumps(pmc_dict, indent=4))
# treat this file as a complete XML file
filename = 'PMC9933422.xml'
pmc_dict = pp.parse_pubmed_xml(filename)
first_element = next(iter(pmc_dict))
print(f'{pmc_dict[first_element]}')
# output: Interventions for myopia control in children: a living systematic review and network meta‐analysis
Hopefully this helps.
I am now using a work-around where in case of exception, I write the full xml into a file. And then I parse them later.
As I cannot really afford to "unpack" all the tar files due to space constraints. Is there a way to give the file while still being within the tar.gz?
I see how space constraints are driving your approach. The in-memory members from the archive acts different than an XML file.
See here for an example unarchiving.