a wrapper for PaddleOCR and k2pdfopt
- page layout analysis (based on
PaddlePaddle
) of a pdf document - text reflow (using
k2pdfopt
) of the pdf document for reading on a kindle paperwhite 3
Firstly, Python 3.7.x ~ 3.8.x
, poppler
and tesseract
should be installed. For details, refer to Other dependencies
below.
Then the installation of all python packages without telegram bot function is as follows (a virtual environment is recommended):
py -m pip install -r requirements.txt
py -m pip install paddlepaddle==2.1.1 -i https://mirror.baidu.com/pypi/simple
py -m pip install -U https://paddleocr.bj.bcebos.com/whl/layoutparser-0.0.0-py3-none-any.whl
Python 3.7.x ~ 3.8.x due to paddle dependency
- pdf processing
PyPDF4
pdf2image
pdf-annotate
- image processing
opencv_contrib_python==4.4.0.46
opencv-python-headless==4.1.2.30
Pillow
Paddle series
- It seems that these packages need to be installed individually.PaddlePaddle
Layout-Parser
PaddleOCR
- others
tqdm
loguru
numpy
pytesseract
python-telegram-bot
(optional, <= 13.15)
poppler
- the dependency ofpdf2image
packagetesseract
- OCR function for figure caption recognition
See options in details:
py pulverizer.py -h
py pulverizer.py yourfile1.pdf [yourfile2.pdf ...] [-c 1] [-p 1 20]
When you run the code for the first time, it will take a while to download model data. After that, page layout analysis will start to work.
Pdf files in the example folder show the result.
Then you could edit the .md
file based on the annotated pdf file (*_box.pdf
or *_annotated.pdf
).
line template of the .md
file
1 x 61.87 697.18 104.68 712.64
pageNumber pageType left bottom right top
pageType
x
for textb
for tablef
for figure
python pulverizer.py yourfile.pdf [yourfile2.pdf ...] -md [-k 300]
The same pattern (arguments) is applied to all yourfile.pdf
.
You can set up one by yourself.
setx PULVERIZER_BOT_TOKEN "your bot token"
export ...
/start
/help
/pl # page layout analysis
/pp # get the .md and box pdf file
/md # reflow
/gp # get current pdf file name
/sp # set current pdf file name
/ls # list current files in your folder
/xk # send the final reflowed pdf file
/rm # clear your folder
# send file with file name
/sn yourfilepath
# rename?
/rn
It is very difficult to pack the source code together via pyinstaller
due to the complex structures of paddle(ocr)
package(s).
- the bottom of rectangle shapes should be lower
-
pdf-annotate
- rectangle shapes have some drift but the pdf cropping is correct -
multiprocess
loses function (change toconcurrent.futures
) - 2023-11-23 - 2023-11-23 - delete the last line of
.md
file - 2022-04-11 - 2022-11-20 -
opencv-python-headless==4.1.2.30
stackoverflow discussion