Extract docx headers, footers, text, footnotes, endnotes, properties, and images to a Python object.
The code is an expansion/contraction of python-docx2txt (Copyright (c) 2015 Ankush Shah). The original code is mostly gone, but some of the bones may still be here.
shared features:
- extracts text from docx files
- extracts images from docx files
- no dependencies (docx2python requires pytest to test)
additions:
- extracts footnotes and endnotes
- converts bullets and numbered lists to ascii with indentation
- converts hyperlinks to
<a href="http:/...">link text</a>
- retains some structure of the original file (more below)
- extracts document properties (creator, lastModifiedBy, etc.)
- inserts image placeholders in text (
'----image1.jpg----'
) - inserts plain text footnote and endnote references in text (
'----footnote1----'
) - (optionally) retains font size, font color, bold, italics, and underscore as html
- extract user selections from checkboxes and dropdown menus
- full test coverage and documentation for developers
subtractions:
- no command-line interface
- will only work with Python 3.4+
pip install docx2python
from docx2python import docx2python
# extract docx content
docx2python('path/to/file.docx')
# extract docx content, write images to image_directory
docx2python('path/to/file.docx', 'path/to/image_directory')
# extract docx content, ignore images
docx2python('path/to/file.docx', extract_image=False)
# extract docx content with basic font styles converted to html
docx2python('path/to/file.docx', html=True)
Note on html feature:
- font size, font color, bold, italics, and underline supported
- hyperlinks will always be exported as html (
<a href="http:/...">link text</a>
), even ifexport_font_style=False
, because I couldn't think of a more cononical representation. - every tag open in a paragraph will be closed in that paragraph (and, where appropriate, reopened in the next paragraph). If two subsequenct paragraphs are bold, they will be returned as
<b>paragraph q</b>
,<b>paragraph 2</b>
. This is intentional to make each paragraph its own entity. - if you specify
export_font_style=True
,>
and<
in your docx text will be encoded as>
and<
Function docx2python
returns an object with several attributes.
header - contents of the docx headers in the return format described herein
footer - contents of the docx footers in the return format described herein
body - contents of the docx in the return format described herein
footnotes - contents of the docx in the return format described herein
endnotes - contents of the docx in the return format described herein
document - header + body + footer (read only)
text - all docx text as one string, similar to what you'd get from python-docx2txt
properties - docx property names mapped to values (e.g., {"lastModifiedBy": "Shay Hill"}
)
images - image names mapped to images in binary format. Write to filesystem with
for name, image in result.images.items():
with open(name, 'wb') as image_destination:
write(image_destination, image)
Some structure will be maintained. Text will be returned in a nested list, with paragraphs always at depth 4 (i.e., output.body[i][j][k][l]
will be a paragraph).
If your docx has no tables, output.body will appear as one a table with all contents in one cell:
[ # document
[ # table
[ # row
[ # cell
"Paragraph 1",
"Paragraph 2",
"-- bulleted list",
"-- continuing bulleted list",
"1) numbered list",
"2) continuing numbered list"
" a) sublist",
" i) sublist of sublist",
"3) keeps track of indention levels",
" a) resets sublist counters"
]
]
]
]
Table cells will appear as table cells. Text outside tables will appear as table cells.
To preserve the even depth (text always at depth 4), nested tables will appear as new, top-level tables. This is clearer with an example:
# docx structure
[ # document
[ # table A
[ # table A row
[ # table A cell 1
"paragraph in table A cell 1"
],
[ # nested table B
[ # table B row
[ # table B cell
"paragraph in table B"
]
]
],
[ # table A cell 2
'paragraph in table A cell 2'
]
]
]
]
becomes ...
[ # document
[ # table A
[ # row in table A
[ # cell in table A
"table A cell 1"
]
]
],
[ # table B
[ # row in table B
[ # cell in table B
"table B cell"
]
]
],
[ # table C
[ # row in table C
[ # cell in table C
"table A cell 2"
]
]
]
]
This ensures text appears
- only once
- in the order it appears in the docx
- always at depth four (i.e.,
result.body[i][j][k][l]
will be a string).
This package provides several documented helper functions in the docx2python.iterators
module. Here are a few recipes possible with these functions:
from docx2python.iterators import enum_cells
def remove_empty_paragraphs(tables):
for (i, j, k), cell in enum_cells(tables):
tables[i][j][k] = [x for x in cell if x]
>>> tables = [[[['a', 'b'], ['a', '', 'd', '']]]]
>>> remove_empty_paragraphs(tables)
[[[['a', 'b'], ['a', 'd']]]]
from docx2python.iterators import enum_at_depth
def html_map(tables) -> str:
"""Create an HTML map of document contents.
Render this in a browser to visually search for data.
:tables: value could come from, e.g.,
* docx_to_text_output.document
* docx_to_text_output.body
"""
# prepend index tuple to each paragraph
for (i, j, k, l), paragraph in enum_at_depth(tables, 4):
tables[i][j][k][l] = " ".join([str((i, j, k, l)), paragraph])
# wrap each paragraph in <pre> tags
for (i, j, k), cell in enum_at_depth(tables, 3):
tables[i][j][k] = "".join(["<pre>{x}</pre>".format(x) for x in cell])
# wrap each cell in <td> tags
for (i, j), row in enum_at_depth(tables, 2):
tables[i][j] = "".join(["<td>{x}</td>".format(x) for x in row])
# wrap each row in <tr> tags
for (i,), table in enum_at_depth(tables, 1):
tables[i] = "".join("<tr>{x}</tr>".format(x) for x in table)
# wrap each table in <table> tags
tables = "".join(['<table border="1">{x}</table>'.format(x) for x in tables])
return ["<html><body>"] + tables + ["</body></html>"]
>>> tables = [[[['a', 'b'], ['a', 'd']]]]
>>> html_toc(tables)
<html>
<body>
<table border="1">
<tr>
<td>
'(0, 0, 0, 0) a'
'(0, 0, 0, 1) b'
</td>
<td>
'(0, 0, 1, 0) a'
'(0, 0, 1, 1) d'
</td>
</tr>
</table>
</body>
</html>
Some fine print about checkboxes:
MS Word has checkboxes that can be checked any time, and others that can only be checked when the form is locked.
The previous print as. \u2610
(open checkbox) or \u2612
(crossed checkbox). Which this module, the latter will
too. I gave checkboxes a bailout value of ----checkbox failed----
if the xml doesn't look like I expect it to,
because I don't have several-thousand test files with checkboxes (as I did with most of the other form elements).
Checkboxes should work, but please let me know if you encounter any that do not.