This simple script performs an OCR (optical character recognition) on a raster PDF file via Tesseract and produces a plain text
Usage
$ pdf2text-OCR.py <input.pdf> <output.txt> <language>
where <language> is a 3-character ISO 639-2 code.
Examples:
$ pdf2text-OCR.py book.pdf book.txt eng
$ pdf2text-OCR.py book.pdf book.txt eng+fra
Remark: It should indeed be "eng", not "en"! I should add a verification. Otherwise, Tesseract produces an error.
Installation for one user
$ mkdir -pv ~/bin
$ cd ~/bin/
$ git clone https://gitlab.com/maxim.leyenson/pdf2text-OCR
and then add the lines
PATH=$PATH:$HOME/bin/pdf2text-OCR
export PATH
to your .bashrc file.
Requirements
- poppler-utils (for pdfinfo),
- GhostScript (gs),
- tesseract
Say, in Fedora Linux you can install them with
$ sudo dnf install -y poppler-utils ghostscript tesseract
$ sudo dnf install -y tesseract-langpack-fra
(and whatever other languages you need)
Remark
You do not need this script if your PDF file already contains a text layer. In this case all you have to do is run
$ pdftotext -layout -nopgbrk book.pdf book.txt
where pdftotext is a part of the standard library poppler-utils.)