ShayHill/docx2python

Custom Document Number Formats

medha-hegde opened this issue · 4 comments

Hi,

I'm processing .docx files which contain "Document Number Formats":
image

On processing the file with this code, it shows up as a usual bulleted list with -- or not a list at all:

Word doc:
image

Code output:
image

Is there a way to extract the custom Document Number Formatting?

Can you provide an example document?

I have tried updating docx2python to identify numbered lists that only exist inside style definitions, but there are issues.

1)\t\n
1)\tQuestion text
1)\tJa
2)\tNej 
--\t<977 fixed xor>\t\t\t\t Ved ikke
--\t
--\t
--\t
3)\t Question
4)\tHello
--\t<fixed> \t
--\t
 

 
2)\t\n
  1. Docx2Python does not understand exotic number formats, just decimal, lowerLetter, upperLetter, lowerRoman, upperRoman, and bullet. This is intentional to keep from extracting unprintable characters to plain text. So, the numbering for questions and answers looks the same.

  2. There are "extra" numbers. This is due to the format of the test. There are paragraphs in the text with a numbered style but no content. In Word, they'll be invisible, but the extraction reveals them. The same thing applies to the extra bullets.

All of these extra characters are valid, but they break half of my tests. They reveal a lot of the "gremlins" that live in docx files. However, if it's useful to you, I can implement this as a keyword switch in the docx2python function and deploy it later this week.