OCRServer: A Ruby repository from Transparency Toolkit

This is the software for the Transparency Toolkit OCR server. It receives data from the document upload form, OCRs the documents, and saves the results.

Install the following packages:

graphicsmagick
poppler-data
poppler-utils
ghostscript
tesseract-ocr
pdftk
libreoffice
openjdk-8-jdk
openjdk-8-jre
libcurl3
libcurl3-gnutls
libcurl4-openssl-dev
libmagic-dev
unoconv

Install the following gems:

mimemagic
docsplit
curb
ruby filemagic
pry
mail
listen
rubyzip

Install Apache Tika server by downloading the .jar from https://tika.apache.org/download.html
Start Tika by running: java -jar tika-server-1.18.jar
Optionally: Install ABBYY. This is not free software, but has higher quality OCR for some file types. Images and image-style PDFs as well as office documents that fail OCR with Tika will default to using ABBYY if it is installed. A license for the command line version can be purchased at https://www.ocr4linux.com/en:pricing:start. The OCR server will default to Tesseract if ABBYY is not installed.
Setup and start https://github.com/TransparencyToolkit/DocUpload Documents need to be uploaded for the rest to work.
Set the following environment variables:

OCR_IN_PATH: The path for documents and metadata to input
OCR_OUT_PATH: The path for documents to output
PROJECT_INDEX: The index name in elastic.

In this directory (for the OCRServer), run: ruby run_ocr.rb It will then listen for new documents in OCR_IN. This must be started BEFORE documents are uploaded.

Note: To run on an existing directory, set inotify_works = false in input_output/load_files.rb

TransparencyToolkit/OCRServer