Extract fails due to IndexError in _get_bullet_string method when the document is created in Pages (MacOS App)

I have seen this happening for files created in Pages but not in files created in MSWord.

How to reproduce

Use Pages (MacOS app) to write a document
make sure to include some bulleted and/or numbered text
save the document as docx
attempt to extract using docx2python

Error encountered

  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 650, in __protected_call__
    return self.run(*args, **kwargs)
  File "/app/extractor/tasks.py", line 6, in ingest_docx_task
    return ingest_docx(file, upload_id)
  File "/app/extractor/lib/ingest.py", line 23, in ingest_docx
    content = extract_document_content(local_file_path)
  File "/app/extractor/lib/extract.py", line 150, in extract_document_content
    return _extract_using_docx2python(local_file_path)
  File "/app/extractor/lib/extract.py", line 130, in _extract_using_docx2python
    doc = docx2python(local_file_path)
  File "/usr/local/lib/python3.8/site-packages/docx2python/main.py", line 61, in docx2python
    body = file_text(context["officeDocument"])
  File "/usr/local/lib/python3.8/site-packages/docx2python/main.py", line 56, in file_text
    return get_text(unzipped, context)
  File "/usr/local/lib/python3.8/site-packages/docx2python/docx_text.py", line 264, in get_text
    branches(ElementTree.fromstring(xml))
  File "/usr/local/lib/python3.8/site-packages/docx2python/docx_text.py", line 248, in branches
    branches(child)
  File "/usr/local/lib/python3.8/site-packages/docx2python/docx_text.py", line 185, in branches
    tables.insert(_get_bullet_string(child, context))
  File "/usr/local/lib/python3.8/site-packages/docx2python/docx_text.py", line 105, in _get_bullet_string
    numFmt = context["numId2numFmts"][numId][int(ilvl)]
IndexError: list index out of range

Additional information

It seems Pages is adding abstractNum nodes that don't contain w:lvl nodes. For example:

    <w:abstractNum w:abstractNumId="0">
        <w:multiLevelType w:val="hybridMultilevel"/>
        <w:numStyleLink w:val="Numbered"/>
    </w:abstractNum>

collect_numFmts (from docx_context.py) then reads and stores these in the context as [].

docx2python/docx2python/docx_context.py

Line 79 in ab8747c

abstractNumId2numFmts[id_] = []

This context is then passed down to _get_bullet_string (from docx_text.py). Then the IndexError when we try to get the number format from context.

docx2python/docx2python/docx_text.py

Line 102 in 07516f2

numFmt = context["numId2numFmts"][numId][int(ilvl)]

Should be an easy upgrade. Can you supply such a file? From: Raiyan <notifications@github.com> Sent: Saturday, October 3, 2020 7:51 PM To: ShayHill/docx2python <docx2python@noreply.github.com> Cc: Subscribed <subscribed@noreply.github.com> Subject: [ShayHill/docx2python] Extract fails due to IndexError in _get_bullet_string method when the document is created in Pages (MacOS App) (#11) I have seen this happening for files created in Pages but not in files created in MSWord. How to reproduce * Use Pages (MacOS app) to write a document * save the document as docx * attempt to extract using docx2python Error encountered File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 385, in trace_task R = retval = fun(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 650, in __protected_call__ return self.run(*args, **kwargs) File "/app/extractor/tasks.py", line 6, in ingest_docx_task return ingest_docx(file, upload_id) File "/app/extractor/lib/ingest.py", line 23, in ingest_docx content = extract_document_content(local_file_path) File "/app/extractor/lib/extract.py", line 150, in extract_document_content return _extract_using_docx2python(local_file_path) File "/app/extractor/lib/extract.py", line 130, in _extract_using_docx2python doc = docx2python(local_file_path) File "/usr/local/lib/python3.8/site-packages/docx2python/main.py", line 61, in docx2python body = file_text(context["officeDocument"]) File "/usr/local/lib/python3.8/site-packages/docx2python/main.py", line 56, in file_text return get_text(unzipped, context) File "/usr/local/lib/python3.8/site-packages/docx2python/docx_text.py", line 264, in get_text branches(ElementTree.fromstring(xml)) File "/usr/local/lib/python3.8/site-packages/docx2python/docx_text.py", line 248, in branches branches(child) File "/usr/local/lib/python3.8/site-packages/docx2python/docx_text.py", line 185, in branches tables.insert(_get_bullet_string(child, context)) File "/usr/local/lib/python3.8/site-packages/docx2python/docx_text.py", line 105, in _get_bullet_string numFmt = context["numId2numFmts"][numId][int(ilvl)] IndexError: list index out of range Additional information It seems Pages is adding abstractNum nodes that don't contain w:lvl nodes. For example: <w:multiLevelType w:val="hybridMultilevel"/> <w:numStyleLink w:val="Numbered"/> </w:abstractNum> collect_numFmts (from docx_context.py) then reads and stores these in the context as []. https://github.com/ShayHill/docx2python/blob/ab8747ca01d079bbc35c982beef3962e61916a7e/docx2python/docx_context.py#L79 This context is then passed down to _get_bullet_string (from docx_text.py). Then the IndexError when we try to get the number format from context. https://github.com/ShayHill/docx2python/blob/07516f22035e7911ab35d6315ffbde478d427970/docx2python/docx_text.py#L102 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#11>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADAKIE4UJ4H24ERPZ53GFD3SI7BGNANCNFSM4SDLILFA>.

Hello @ShayHill,
sure I can. I don't see any option to attach a file in the comments. Can I email one to you?

Edit: I sent you two file for testing. Please let me know if there's anything else I can do to help. Thanks.

That's perfect. Thank you for the files. I couldn't get a perfect fix (see below), but the files should process now.

---- version 1.26 - 201005 Continue (with bullet) when numbering-format lookup fails

Some documents created in Pages use a different indexing scheme to specify numbered-list formats and values. I cannot
infer formats or values from such files without potentially changing existing (working) behavior. The previous behavior in such
cases was to fail with an IndexError. v1.26 will now replace any numbering format with a "bullet" (--) when the format
or value cannot be inferred.

This will only happen where the program would previously have failed with an IndexError, so no previous behavior (which
allowed the program to complete) has been altered.

Thanks for fixing this super fast!

Thank you for pointing it out and providing a sample file. I had about 5500 documents to test with, but they were all generated in Windows. From: Raiyan <notifications@github.com> Sent: Monday, October 5, 2020 9:46 PM To: ShayHill/docx2python <docx2python@noreply.github.com> Cc: Shay Hill <shay_public@hotmail.com>; State change <state_change@noreply.github.com> Subject: Re: [ShayHill/docx2python] Extract fails due to IndexError in _get_bullet_string method when the document is created in Pages (MacOS App) (#11) Thanks for fixing this super fast! — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub<#11 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADAKIEYTC3K4CXMCX2VEEQDSJKAIJANCNFSM4SDLILFA>.