Reflow a PDF file for e-readers like Kindle.
- Support English documents only.
- Support one-column or two-columns documents.
- Designed for academic papers (see the example below).
- Linux-based OS
- Python 3.6+
- Install tesseract-related packages: install tesseract-ocr (I use Tesseract 5 and tessdata_best models) and pytesseract (
pip install pytesseract
) - Install layout-parser along with dependency packages:
pip install layoutparser
, install detectron2, install paddledetection - Install other smaller packages:
pip install opencv-python Pillow pdf2image numpy
# For a single file
python reflower.py --source ./input.pdf --target ./output.pdf --target_paper pw3
# Parallel processing for multiple files
sudo apt install parallel
mkdir -p output log
find input/ -name "*.pdf" | parallel -j 4 --bar --results log python reflower.py --source {} --target ./output/{/} --target_paper pw3
find log -type f -name stderr -not -empty -printf '\n==> %p <==\n' -exec cat {} \;
Click the filename to download the PDF file, click the image to view in a new tab.
- Copying vector text instead of rasterized text (need to first convert text in pdf to outlines). But this may slow down a PDF reader so will not be suitable for e-readers like Kindle?
- Support scaling
- Don't do a second OCR with ocrmypdf, instead use the first OCR results to create the invisible text layer. (Update: ocrmypdf has been removed, currently no text layer is added. If you need this, simply use ocrmypdf cli)
- Bad results with a document with many inline formulas (mainly because of poor OCR results)
- Too slow :(