PDF-to-Text

Converts a batch of PDF files to text, with optional keyword matching to move matches into a separate directory using the Tesseract OCR and pdf2image packages.

pdf-to-text was originally built as an afternoon project to aid a close friend in quickly locating relevant information after receiving several thousands of PDFs in an open records request.

How it works, and why you should use it

There is literally no good reason to use it. There are numerous packages that do these things better and faster — messing with unreadable and unpredictable data is just fun. With that being said, given a source directory containing PDFs,

Convert a PDF file into a JPEG using pdf2image, exporting all images into a temporary directory;
Convert the JPEG into TXT using pytesseract, exporting the resulting file text into the output directory;
If keywords are provided, scan the text files and check if any keywords are present within the extracted text. If it is, the file is moved to a matches directory with the output directory;
By default, or if explicitly provided, PDF file sizes will be checked prior to processing. If the file exceeds the max size, the file is moved to a skipped within the output directory;
Unless explicitly specified, all images converted from PDF are deleted after the PDF processing stage.

Installation

Usage of this package requires Tesseract OCR as well as package dependencies:

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Usage

After installing dependencies, you can run the pdf_to_text command with the -h flag to see all available options:

python src/pdf_to_text.py -h

Local Development

A separate requirements-dev.txt file is included for linting, pre-commit checks, testing, etc. To start, create a virtualenv and install all dependencies:

python -m venv .venv && source .venv/bin/activate
pip install -r requirements-dev.txt
pre-commit install

romansorin/pdf-to-text

PDF-to-Text

How it works, and why you should use it

Installation

Usage

Local Development

Roadmap/TODO