pdf2text-OCR

This simple script performs an OCR (optical character recognition) on a raster PDF file via Tesseract and produces a plain text

Usage

$ pdf2text-OCR.py <input.pdf> <output.txt> <language>

where <language> is a 3-character ISO 639-2 code.

Examples:

$ pdf2text-OCR.py  book.pdf book.txt eng
$ pdf2text-OCR.py  book.pdf book.txt eng+fra

Remark: It should indeed be "eng", not "en"! I should add a verification. Otherwise, Tesseract produces an error.

Installation for one user

   $ mkdir -pv ~/bin
   $ cd ~/bin/
   $ git clone  https://gitlab.com/maxim.leyenson/pdf2text-OCR

and then add the lines

PATH=$PATH:$HOME/bin/pdf2text-OCR
export PATH

to your .bashrc file.

Requirements

Say, in Fedora Linux you can install them with

$ sudo dnf install -y poppler-utils ghostscript tesseract
$ sudo dnf install -y tesseract-langpack-fra

(and whatever other languages you need)

Remark

You do not need this script if your PDF file already contains a text layer. In this case all you have to do is run

$ pdftotext -layout -nopgbrk book.pdf book.txt

where pdftotext is a part of the standard library poppler-utils.)

MaximLeyenson/pdf2text-OCR