/paddlePulverizer

page layout analysis, ready to use, a wrapper for PaddleOCR

Primary LanguagePython

PaddlePulverizer

a wrapper for PaddleOCR and k2pdfopt

Introduction

  1. page layout analysis (based on PaddlePaddle) of a pdf document
  2. text reflow (using k2pdfopt) of the pdf document for reading on a kindle paperwhite 3

Installation

Firstly, Python 3.7.x ~ 3.8.x, poppler and tesseract should be installed. For details, refer to Other dependencies below.

Then the installation of all python packages without telegram bot function is as follows (a virtual environment is recommended):

py -m pip install -r requirements.txt
py -m pip install paddlepaddle==2.1.1 -i https://mirror.baidu.com/pypi/simple
py -m pip install -U https://paddleocr.bj.bcebos.com/whl/layoutparser-0.0.0-py3-none-any.whl

Dependencies

Python 3.7.x ~ 3.8.x due to paddle dependency

Python packages

  • pdf processing
    • PyPDF4
    • pdf2image
    • pdf-annotate
  • image processing
    • opencv_contrib_python==4.4.0.46
    • opencv-python-headless==4.1.2.30
    • Pillow
  • Paddle series - It seems that these packages need to be installed individually.
    • PaddlePaddle
    • Layout-Parser
    • PaddleOCR
  • others

Other dependencies

Optional functions

  • k2pdfopt - reflow of pdf text file
    • Ubuntu - sudo apt-get install k2pdfopt -y

Usage

Command line

Help

See options in details:

py pulverizer.py -h

Page layout analysis

py pulverizer.py yourfile1.pdf [yourfile2.pdf ...] [-c 1] [-p 1 20]

When you run the code for the first time, it will take a while to download model data. After that, page layout analysis will start to work.

Pdf files in the example folder show the result.

Then you could edit the .md file based on the annotated pdf file (*_box.pdf or *_annotated.pdf).

line template of the .md file

1	x	61.87	697.18	104.68	712.64
pageNumber pageType left bottom right top
  • pageType
    • x for text
    • b for table
    • f for figure

Crop pdf(s) based on .md file and reflow the text

python pulverizer.py yourfile.pdf [yourfile2.pdf ...] -md [-k 300]

The same pattern (arguments) is applied to all yourfile.pdf.

Telegram bot

You can set up one by yourself.

Settings

windows
setx PULVERIZER_BOT_TOKEN "your bot token"
macOS
export ...
Linux

Functions

Basics of Telegram Bot
/start
/help
Core
/pl    # page layout analysis
/pp    # get the .md and box pdf file
/md    # reflow
file manipulations
/gp    # get current pdf file name
/sp    # set current pdf file name
/ls    # list current files in your folder
/xk    # send the final reflowed pdf file
/rm    # clear your folder
# send file with file name
/sn yourfilepath  

# rename?
/rn

Problem

It is very difficult to pack the source code together via pyinstaller due to the complex structures of paddle(ocr) package(s).

Issues

  • the bottom of rectangle shapes should be lower
  • pdf-annotate - rectangle shapes have some drift but the pdf cropping is correct
  • multiprocess loses function (change to concurrent.futures) - 2023-11-23 - 2023-11-23
  • delete the last line of .md file - 2022-04-11 - 2022-11-20
  • opencv-python-headless==4.1.2.30 stackoverflow discussion

References