Extract fails due to IndexError in _get_bullet_string method when the document is created in Pages (MacOS App)
raiyankamal opened this issue · 5 comments
I have seen this happening for files created in Pages but not in files created in MSWord.
How to reproduce
- Use Pages (MacOS app) to write a document
- make sure to include some bulleted and/or numbered text
- save the document as docx
- attempt to extract using docx2python
Error encountered
File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 385, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 650, in __protected_call__
return self.run(*args, **kwargs)
File "/app/extractor/tasks.py", line 6, in ingest_docx_task
return ingest_docx(file, upload_id)
File "/app/extractor/lib/ingest.py", line 23, in ingest_docx
content = extract_document_content(local_file_path)
File "/app/extractor/lib/extract.py", line 150, in extract_document_content
return _extract_using_docx2python(local_file_path)
File "/app/extractor/lib/extract.py", line 130, in _extract_using_docx2python
doc = docx2python(local_file_path)
File "/usr/local/lib/python3.8/site-packages/docx2python/main.py", line 61, in docx2python
body = file_text(context["officeDocument"])
File "/usr/local/lib/python3.8/site-packages/docx2python/main.py", line 56, in file_text
return get_text(unzipped, context)
File "/usr/local/lib/python3.8/site-packages/docx2python/docx_text.py", line 264, in get_text
branches(ElementTree.fromstring(xml))
File "/usr/local/lib/python3.8/site-packages/docx2python/docx_text.py", line 248, in branches
branches(child)
File "/usr/local/lib/python3.8/site-packages/docx2python/docx_text.py", line 185, in branches
tables.insert(_get_bullet_string(child, context))
File "/usr/local/lib/python3.8/site-packages/docx2python/docx_text.py", line 105, in _get_bullet_string
numFmt = context["numId2numFmts"][numId][int(ilvl)]
IndexError: list index out of range
Additional information
It seems Pages is adding abstractNum
nodes that don't contain w:lvl
nodes. For example:
<w:abstractNum w:abstractNumId="0">
<w:multiLevelType w:val="hybridMultilevel"/>
<w:numStyleLink w:val="Numbered"/>
</w:abstractNum>
collect_numFmts
(from docx_context.py
) then reads and stores these in the context as []
.
docx2python/docx2python/docx_context.py
Line 79 in ab8747c
This context
is then passed down to _get_bullet_string
(from docx_text.py
). Then the IndexError
when we try to get the number format from context.
docx2python/docx2python/docx_text.py
Line 102 in 07516f2
Hello @ShayHill,
sure I can. I don't see any option to attach a file in the comments. Can I email one to you?
Edit: I sent you two file for testing. Please let me know if there's anything else I can do to help. Thanks.
That's perfect. Thank you for the files. I couldn't get a perfect fix (see below), but the files should process now.
---- version 1.26 - 201005 Continue (with bullet) when numbering-format lookup fails
Some documents created in Pages use a different indexing scheme to specify numbered-list formats and values. I cannot
infer formats or values from such files without potentially changing existing (working) behavior. The previous behavior in such
cases was to fail with an IndexError. v1.26 will now replace any numbering format with a "bullet" (--) when the format
or value cannot be inferred.
This will only happen where the program would previously have failed with an IndexError, so no previous behavior (which
allowed the program to complete) has been altered.
Thanks for fixing this super fast!