Parse all contents of a docx file with python-docx
python3 -m pip install docx-parser
paragraph
: text paragraph, with style_id
multipart
: paragraph with image or hyperlink
table
: table data with merged_cells
docx_parser --help
# parse image as file
docx_parser tests/demo.docx -D tests/media -o tests/out.file.jl
# parse image as base64 string
docx_parser tests/demo.docx -A base64 -o tests/out.base64.jl
from docx_parser import DocumentParser
infile = 'tests/demo.docx'
doc = DocumentParser(infile)
for _type, item in doc.parse():
print(_type, item)
- parse text style: color, bgcolor, font, bold, italic ...
- parse paragraph format