OCRmyFiles

Bash script for adding a text layer to PDF files and converting images in PDFs (with OCR).

Adds an OCR text layer to all PDF files in the given input directory and saves the new PDF files to the output directory.

When the input directory also contains image files (e.g. jpg, png), these are converted to (OCR'ed) PDFs.

All other file types are just copied from the input directory to the output directory.

Requirements

OCRmyPDF
For Debian 9/Ubuntu 16.10: apt-get install ocrmypdf
For other distros: https://ocrmypdf.readthedocs.io/en/latest/installation.html
Tesseract
This is installed with OCRmyPDF automatically
Tesseract language files
e.g. apt-get install tesseract-ocr-deu for German language

Download script or clone repository
Make script executable sudo chmod +x OCRmyFiles.sh
Modify the script to fit your needs:
- Set default input/output directories
- Modify the OCRmyPDF command line arguments (you can find an overview of available command line arguments here)
- Modify the Tesseract command line arguments (you can find an overview of available command line arguments here)
Call the script:
- OCRmyFiles.sh (no parameter): using default directories for input/output (as defined in the script itself)
- OCRmyFiles.sh <inputDir> <outputDir>: using specified directories for input/output
The script might print some warnings/errors from Tesseract. These can be ignored in most cases as the OCR text layer will be created anyway
You can also call this script with a cronjob for automated processing of PDFs/images:
- With the user the cronjob should be executed, call contab -e
- Add the following to run the script e.g. every 30 minutes: */30 * * * * /path/to/the/script/OCRmyFiles.sh > /dev/null 2>&1