/pdf2docx

Parse PDF file with PyMuPDF and generate docx with python-docx

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

pdf2docx

python-version codecov pypi-version license

  • Parse layout (text, image and table) from PDF file with PyMuPDF
  • Generate docx with python-docx

Features

  • Parse and re-create paragraph

    • text in horizontal/vertical direction: from left to right, from bottom to top
    • font style, e.g. font name, size, weight, italic and color
    • text format, e.g. highlight, underline, strike-through
    • text alignment, e.g. left/right/center/justify
    • external hyper link
    • paragraph layout: horizontal alignment and vertical spacing
    • list style
  • Parse and re-create image

    • in-line image
    • image in Gray/RGB/CMYK mode
    • transparent image
    • floating image, i.e. picture behind text
  • Parse and re-create table

    • border style, e.g. width, color
    • shading style, i.e. background color
    • merged cells
    • vertical direction cell
    • table with partly hidden borders
    • nested tables
  • Parsing pages with multi-processing

It can also be used as a tool to extract table contents since both table content and format/style is parsed.

Limitations

  • Text-based PDF file only
  • Normal reading direction only
    • horizontal/vertical paragraph/line/word
    • no word transformation, e.g. rotation

Documentation

Sample

sample_compare.png