AttributeError: 'NoneType' object has no attribute 'find'
jacobthill opened this issue · 1 comments
I am unsure what the problem is but I keep getting the following error when trying to harvest a collection from Qatar Digital Library. I have to harvest through a whitelisted server, so unfortunately, no one will be able to test but I'm hoping someone has a better instinct about why I'm getting this error and, more importantly, how to avoid it. The last time I harvested these records there were more that 32k but I keep getting this error on number 18,108. I would like to just pass over this record (and any other record with a similar problem) and harvest the rest of them but the script always stops on this record. Here is the complete error message:
Traceback (most recent call last):
File "qnl-harvest.py", line 26, in <module>
for count, record in enumerate(records, start=1):
File "/opt/app/harvester/.local/lib/python3.4/site-packages/sickle/iterator.py", line 52, in __next__
return self.next()
File "/opt/app/harvester/.local/lib/python3.4/site-packages/sickle/iterator.py", line 151, in next
self._next_response()
File "/opt/app/harvester/.local/lib/python3.4/site-packages/sickle/iterator.py", line 138, in _next_response
super(OAIItemIterator, self)._next_response()
File "/opt/app/harvester/.local/lib/python3.4/site-packages/sickle/iterator.py", line 85, in _next_response
error = self.oai_response.xml.find(
AttributeError: 'NoneType' object has no attribute 'find'
Here is my script:
import errno, os
from sickle import Sickle
from sickle.iterator import OAIResponseIterator
# where to write data to (relative to the dlme-harvest repo folder)
base_output_folder = 'output'
sickle = Sickle('https://api.qdl.qa/oaipmh')
print("Sickle instance created.") # status update
records = sickle.ListRecords(metadataPrefix='mods', ignore_deleted=True)
print("Records created.") # status update
directory = "output/qnl/data/"
os.makedirs(os.path.dirname(directory), exist_ok=True)
for count, record in enumerate(records, start=1):
try:
print("Record number " + str(count))
out_file = 'output/qnl/data/qnl-{}.xml'.format(count)
directory_name = os.path.dirname(out_file)
with open(out_file, 'w') as f:
f.write(record.raw)
except Exception as err:
print(err)
Cannot reproduce this because the OAI interface is restricted. I suspect that the interface returns an empty response, something like:
</>
Sickle uses an XML parser that forgives some flaws in the XML structure. This response will cause the parsed result to be None
:
>>> XMLParser = etree.XMLParser(remove_blank_text=True, recover=True, resolve_entities=False)
>>> type(etree.XML('</>', parser=XMLParser)
NoneType