PDFSPAM
A very basic re-implementation of pdfsandwich in Python, since
- The given implementation didn't work on my system
- All the tool invocations were nicely logged in verbose mode
- Python was the easiest way to reach my goal
What does it do? From Tobias Elze's website:
pdfsandwich is a command line tool which is supposed to be useful to OCR scanned books or journals. It is able to recognize the page layout even for multicolumn text.
Essentially, pdfsandwich is a wrapper script which calls the following binaries: convert, gs, hocr2pdf, and tesseract. It is known to run on Unix systems and has been tested on Linux and MacOS X. It supports parallel processing on multiprocessor systems.
Feel free to use & modify!