Reflower

Reflow a PDF file for e-readers like Kindle.

Support English documents only.
Support one-column or two-columns documents.
Designed for academic papers (see the example below).

Get Started

Requirements

Linux-based OS
Python 3.6+

Preparations

Install tesseract-related packages: install tesseract-ocr (I use Tesseract 5 and tessdata_best models) and pytesseract (pip install pytesseract)
Install layout-parser along with dependency packages: pip install layoutparser, install detectron2, install paddledetection
Install other smaller packages: pip install opencv-python Pillow pdf2image numpy

Run

# For a single file
python reflower.py --source ./input.pdf --target ./output.pdf --target_paper pw3

# Parallel processing for multiple files
sudo apt install parallel
mkdir -p output log
find input/ -name "*.pdf" | parallel -j 4 --bar --results log python reflower.py --source {} --target ./output/{/} --target_paper pw3
find log -type f -name stderr -not -empty -printf '\n==> %p <==\n' -exec cat {} \;

Example

Click the filename to download the PDF file, click the image to view in a new tab.

input.pdf

intermediate.pdf

output.pdf

TODO

Copying vector text instead of rasterized text (need to first convert text in pdf to outlines). But this may slow down a PDF reader so will not be suitable for e-readers like Kindle?
Support scaling
Don't do a second OCR with ocrmypdf, instead use the first OCR results to create the invisible text layer. (Update: ocrmypdf has been removed, currently no text layer is added. If you need this, simply use ocrmypdf cli)
Bad results with a document with many inline formulas (mainly because of poor OCR results)
Too slow :(

yusanshi/reflower

Reflower

Get Started

Requirements

Preparations

Run

Example

TODO