harvard-lil/capstone

Errors in cap-static export

bensteinberg opened this issue · 1 comments

I see a few different errors in the export logs today; I think they were not introduced in #2192, the proximate reason for re-running the export, but I no longer have the logs from the previous export. FWIW these errors appear to have killed the export process early, which I think did not happen before; the redacted export from this run only got to 6.4G, which is about half what it should be, I believe.

The errors I see look like

[2024-02-05 15:19:00,087: ERROR/ForkPoolWorker-3] Task scripts.export_cap_static.export_cases_by_volume[69011773-07dd-4cdf-8918-efbf0babfb36] raised unexpected: TypeError('sequence item 1: expected str instance, NoneType found')
Traceback (most recent call last):
  File "/home/capstone/capstone-prod/lib/python3.7/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/capstone/capstone-prod/lib/python3.7/site-packages/celery/app/trace.py", line 648, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/local/share/capstone-prod/capstone/scripts/export_cap_static.py", line 109, in export_cases_by_volume
    export_volume(volume, dest_dir / "redacted")
  File "/usr/local/share/capstone-prod/capstone/scripts/export_cap_static.py", line 241, in export_volume
    el.attrib["data-case-paths"] = ",".join(el_case_paths)
TypeError: sequence item 1: expected str instance, NoneType found

and

[2024-02-05 15:19:30,483: ERROR/ForkPoolWorker-18] Task scripts.export_cap_static.export_cases_by_volume[60ad3c4e-0104-4dec-b5bb-4edb5862a1f2] raised unexpected: TypeError("Argument must be bytes or unicode, got 'NoneType'")
Traceback (most recent call last):
  File "/home/capstone/capstone-prod/lib/python3.7/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/capstone/capstone-prod/lib/python3.7/site-packages/celery/app/trace.py", line 648, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/local/share/capstone-prod/capstone/scripts/export_cap_static.py", line 109, in export_cases_by_volume
    export_volume(volume, dest_dir / "redacted")
  File "/usr/local/share/capstone-prod/capstone/scripts/export_cap_static.py", line 240, in export_volume
    el.attrib["href"] = el_case_paths[0]
  File "src/lxml/etree.pyx", line 2446, in lxml.etree._Attrib.__setitem__
  File "src/lxml/apihelpers.pxi", line 594, in lxml.etree._setAttributeValue
  File "src/lxml/apihelpers.pxi", line 1539, in lxml.etree._utf8
TypeError: Argument must be bytes or unicode, got 'NoneType'

and occasionally

[2024-02-05 15:34:55,409: ERROR/ForkPoolWorker-96] Task scripts.export_cap_static.export_cases_by_volume[b39a7b36-0882-40f8-b181-e5ca56d41118] raised unexpected: KeyError('href')
Traceback (most recent call last):
  File "/home/capstone/capstone-prod/lib/python3.7/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/capstone/capstone-prod/lib/python3.7/site-packages/celery/app/trace.py", line 648, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/local/share/capstone-prod/capstone/scripts/export_cap_static.py", line 109, in export_cases_by_volume
    export_volume(volume, dest_dir / "redacted")
  File "/usr/local/share/capstone-prod/capstone/scripts/export_cap_static.py", line 242, in export_volume
    elif "/citations/?q=" in el.attrib["href"]:
  File "src/lxml/etree.pyx", line 2496, in lxml.etree._Attrib.__getitem__
KeyError: 'href'

I deployed #2198 and am rerunning exports; I still see errors, though they're in different places:

[2024-02-06 18:21:06,413: ERROR/ForkPoolWorker-58] Task scripts.export_cap_static.export_cases_by_volume[8442323f-6786-43e9-9fa7-dac3cc961cac] raised unexpected: TypeError('sequence item 1: expected str instance, NoneType found')
Traceback (most recent call last):
  File "/home/capstone/capstone-prod/lib/python3.7/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/capstone/capstone-prod/lib/python3.7/site-packages/celery/app/trace.py", line 648, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/local/share/capstone-prod/capstone/scripts/export_cap_static.py", line 109, in export_cases_by_volume
    volume = VolumeMetadata.objects.select_related("reporter").get(pk=volume_id)
  File "/usr/local/share/capstone-prod/capstone/scripts/export_cap_static.py", line 241, in export_volume
    # handle citations to documents outside our collection
TypeError: sequence item 1: expected str instance, NoneType found
[2024-02-06 18:21:03,919: ERROR/ForkPoolWorker-63] Task scripts.export_cap_static.export_cases_by_volume[33025837-b921-4e29-a3a4-18207222f25c] raised unexpected: TypeError("Argument must be bytes or unicode, got 'NoneType'")
Traceback (most recent call last):
  File "/home/capstone/capstone-prod/lib/python3.7/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/capstone/capstone-prod/lib/python3.7/site-packages/celery/app/trace.py", line 648, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/local/share/capstone-prod/capstone/scripts/export_cap_static.py", line 109, in export_cases_by_volume
    volume = VolumeMetadata.objects.select_related("reporter").get(pk=volume_id)
  File "/usr/local/share/capstone-prod/capstone/scripts/export_cap_static.py", line 240, in export_volume
    for el in pq_html("a.citation"):
  File "src/lxml/etree.pyx", line 2446, in lxml.etree._Attrib.__setitem__
  File "src/lxml/apihelpers.pxi", line 594, in lxml.etree._setAttributeValue
  File "src/lxml/apihelpers.pxi", line 1539, in lxml.etree._utf8
TypeError: Argument must be bytes or unicode, got 'NoneType'