hampelm/irs-990-ocr

Tools for processing scanned 990s

Thoughts / approaches to processing 990s

Included in this toybox:

Several sample PDFs!
A set of commands to convert PDFs to plain text!
A compiled version of pdf-splitter for mac!

You'll need:

pdftk
Some stuff from brew (make sure to run brew update first):
- brew install imagemagick
- brew install tesseract

OCR Ahoy!

Get the first page of a PDF:
pdftk pdfs/52-6078041_990PF_200706.pdf shuffle 1 output 52-607.pdf

Turn that first page into an image:
./pdf-splitter 52-607.pdf 'img/%.d.jpg' 1200px

Get a section of the page to process:
convert img/1.jpg -crop 490x20+185+225 img/1.crop.jpg

OCR the image (you'll get a file named 1.txt):
tesseract img/1.crop.jpg 1