/reflower

Reflow a PDF file for e-readers like Kindle.

Primary LanguagePythonGNU Affero General Public License v3.0AGPL-3.0

Reflower

Reflow a PDF file for e-readers like Kindle.

  • Support English documents only.
  • Support one-column or two-columns documents.
  • Designed for academic papers (see the example below).

Get Started

Requirements

  • Linux-based OS
  • Python 3.6+

Preparations

Run

# For a single file
python reflower.py --source ./input.pdf --target ./output.pdf --target_paper pw3

# Parallel processing for multiple files
sudo apt install parallel
mkdir -p output log
find input/ -name "*.pdf" | parallel -j 4 --bar --results log python reflower.py --source {} --target ./output/{/} --target_paper pw3
find log -type f -name stderr -not -empty -printf '\n==> %p <==\n' -exec cat {} \;

Example

Click the filename to download the PDF file, click the image to view in a new tab.

input.pdf

input

intermediate.pdf

intermediate

output.pdf

output

TODO

  • Copying vector text instead of rasterized text (need to first convert text in pdf to outlines). But this may slow down a PDF reader so will not be suitable for e-readers like Kindle?
  • Support scaling
  • Don't do a second OCR with ocrmypdf, instead use the first OCR results to create the invisible text layer. (Update: ocrmypdf has been removed, currently no text layer is added. If you need this, simply use ocrmypdf cli)
  • Bad results with a document with many inline formulas (mainly because of poor OCR results)
  • Too slow :(