Custom Document Number Formats

Question

Custom Document Number Formats

medha-hegde opened this issue 5 months ago · 4 comments

medha-hegde commented 5 months ago

Hi,

I'm processing .docx files which contain "Document Number Formats":

On processing the file with this code, it shows up as a usual bulleted list with -- or not a list at all:

Word doc:

Code output:

Is there a way to extract the custom Document Number Formatting?

Answer 1 · 2024-10-03T20:52:16.000Z

I will have a look. On Oct 3, 2024, at 11:24, medha hegde ***@***.***> wrote: Hi, I'm processing .docx files which contain "Document Number Formats": image.png (view on web)<https://github.com/user-attachments/assets/0b1fbd30-f5d5-4dd8-8ca2-15c23befc59e> On processing the file with this code, it shows up as a usual bulleted list with --: Word doc: image.png (view on web)<https://github.com/user-attachments/assets/e0f2b27a-179b-4624-b470-3dde6fd39a99> Code output: image.png (view on web)<https://github.com/user-attachments/assets/77109eeb-2fb1-4cc1-9993-bab531682c9c> Is there a way to extract the custom Document Number Formatting? — Reply to this email directly, view it on GitHub<#75>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADAKIEYAYEXFASBPSHHV2KLZZVVSDAVCNFSM6AAAAABPKEQZP6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGU3DINBWGE3TENQ>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Answer 2 · 2024-10-05T18:58:33.000Z

Can you provide an example document?

Answer 3 · 2024-10-07T07:37:03.000Z

sure!
word2text.docx

Answer 4 · 2024-10-08T03:56:39.000Z

I have tried updating docx2python to identify numbered lists that only exist inside style definitions, but there are issues.

1)\t\n
1)\tQuestion text
1)\tJa
2)\tNej 
--\t<977 fixed xor>\t\t\t\t Ved ikke
--\t
--\t
--\t
3)\t Question
4)\tHello
--\t<fixed> \t
--\t
 

 
2)\t\n

Docx2Python does not understand exotic number formats, just decimal, lowerLetter, upperLetter, lowerRoman, upperRoman, and bullet. This is intentional to keep from extracting unprintable characters to plain text. So, the numbering for questions and answers looks the same.
There are "extra" numbers. This is due to the format of the test. There are paragraphs in the text with a numbered style but no content. In Word, they'll be invisible, but the extraction reveals them. The same thing applies to the extra bullets.

All of these extra characters are valid, but they break half of my tests. They reveal a lot of the "gremlins" that live in docx files. However, if it's useful to you, I can implement this as a keyword switch in the docx2python function and deploy it later this week.