Awesome OCR

This list contains links to great software tools and libraries and literature related to Optical Character Recognition (OCR).

Contributions are welcome, as is feedback.

Software
- OCR engines
- Older and possibly abandoned OCR engines
- OCR file formats
  - hOCR
  - ALTO XML
  - TEI
- OCR CLI
- OCR GUI
- OCR Preprocessing
- OCR as a Service
- OCR evaluation
- OCR libraries by programming language
  - Go
  - Java
  - .Net
  - Object Pascal
  - PHP
  - Python
  - Javascript
  - Ruby
  - Rust
  - R
- OCR training tools
Datasets
- Ground Truth
Literature

Software

OCR engines

tesseract - The definitive Open Source OCR engine Apache 2.0
ocropus - OCR engine based on LSTM, Apache 2.0
ocropus 0.4 - Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++
kraken - Ocropus fork with sane defaults
Ocrad - The GNU OCR. GPL
digit - OCR for numbers in meter displays, such as a power meter, using caffe
ocular - Machine-learning OCR for historic documents
SwiftOCR - fast and simple OCR library written in Swift
attention-ocr - OCR engine using visual attention mechanisms
RWTH-OCR - The RWTH Aachen University Optical Character Recognition System
simple-ocr-opencv and its fork - A simple pythonic OCR engine using opencv and numpy
Calamari - OCR Engine based on OCRopy and Kraken

Older and possibly abandoned OCR engines

Clara OCR - Open source OCR in C GPL
Cuneiform - CuneiForm OCR was developed by Cognitive Technologies
Eye - an experimental Java OCR (image-to-text) application
kognition - An omnifont OCR software for KDE
OCRchie - Modular Optical Character Recognition Software
ocre - o.c.r. easy
xplab - A GTK 2 tool for pattern matching
hebOCR - Hebrew character recognition library (previously named hocr, see Wikipedia article) GPL

OCR file formats

hOCR

hocr-tools - Tools for doing various useful things with hOCR files, Apache 2.0
hocr-spec - hOCR 1.1 specification
ocr-transform - CLI tool to convert between hOCR and ALTO, MIT
hocr-parser - hOCR Specification Python Parser
hOCRTools - hOCR to ALTO conversion XSLT

ALTO XML

ALTO XML Schema - XML Schema and development of the ALTO XML format
ALTO XML Documentation - Documentation and use cases for ALTO
alto-tools - Various tools to work with ALTO files, Python
AbbyyToAlto - PHP script converting from Abbyy 6 to ALTO XML

TEI

TEI-OCR - TEI customization for OCR generated layout and content information
TEI SIG on Libraries - Best Practices for TEI in Libraries
GDZ - METS/TEI-based GDZ document format

OCR CLI

OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Pdf2PdfOCR - A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported.
Ocrocis - Project manager interface for Ocropy, see also external project homepage

OCR GUI

moz-hocr-editor - Firefox Addon for editing hOCR files Discontinued
qt-box-editor - QT4 editor of tesseract-ocr box files.
ocr-gt-tools - Client-Server application for editing OCR ground truth.
Paperwork - Using scanners and OCR to grep paper documents the easy way.
Paperless - Scan, index, and archive all of your paper documents.
gImageReader - gImageReader is a simple Gtk/Qt front-end to tesseract-ocr.
VietOCR - A Java/.NET GUI frontend for Tesseract OCR engine, including jTessBoxEditor a graphical Tesseract box data editor
PoCoTo - Fast interactive batch corrections of complete OCR error series in OCR'ed historical documents.
OCRFeeder - GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more.
PRImA PAGE Viewer - Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR.
LAREX - A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
archiscribe - Web application for transcribing OCR ground truth from Archive.org. Deployed instance available at https://archiscribe.jbaiter.de/, results are available in @jbaiter/archiscribe-corpus.

OCR Preprocessing

NoiseRemove.java in MathOCR - Java implementation of Adaptive degraded document image binarization by B. Gatos , I. Pratikakis, S.J. Perantonis
binarize.c in ZBar - C implementations of two binarization algorithms, based on Sauvola
typeface-corpus - A repository for typefaces to train Tesseract and OCRopus for natural history collections and digital humanities.
binarizewolfjolion - Comparison of binarization algorithms. Blog post
crop_morphology.py in oldnyc - Cropping a page to just the text block
Whiteboard Picture Cleaner - Shell one-liner/script to clean up and beautify photos of whiteboards
Fred's ImageMagick script textcleaner - Processes a scanned document of text to clean the text background
localcontrast - Fast O(1) local contrast optimization

OCR as a Service

Open OCR - Run Tesseract in Docker containers
tesseract-web-service - An implementation of RESTful web service for tesseract-OCR using tornado.
docker-ocropy - A Docker container for running the ocropy OCR system.
ABBYY Cloud OCR SDK Code samples - Code samples for using the proprietary commercial ABBYY OCR API.
nidaba - An expandable and scalable OCR pipeline
gamera - A meta-framework for building document processing applications, e.g. OCR
ocr-tools - Project to provide CLI and web service interfaces to common OCR engines
ocrad-docker - Run the ocrad OCR engine in a docker container
kraken-docker - Run the kraken OCR engine in a docker container
ocr.space - Free Online OCR and OCR API by @a9t9 based on Tesseract (code is not open)

OCR evaluation

ISRI OCR Evaluation Tools with a User Guide from 1996 :!:
- isri-ocr-evaluation-tools - further development by @eddieantonio (2015, 2016)
- ancientgreekocr-evaluation-tools - further development by @nickjwhite (2013, 2014)
ocrevalUAtion - Cross-format evaluation, CLI and GUI
ngram-ocr-eval - Brute and simple OCR evaluation using ngrams
quack - Quality-Assurance-tool for scans with corresponding ALTO-files

OCR libraries by programming language

Go

gosseract - Golang OCR library, wrapping Tesseract-ocr.

Java

Tess4J - Java Native Access bindings to Tesseract.
tess-two - Tools for compiling Tesseract on Android and Java API.

.Net

tesseract for .net - A .Net wrapper for tesseract-ocr.

Object Pascal

TTesseractOCR4 - Object Pascal binding for tesseract-ocr 4.x.

PHP

Tesseract OCR for PHP - Tesseract PHP bindings.

Python

pytesseract - A Python wrapper for Google Tesseract.
pyocr - A Python wrapper for Tesseract and Cuneiform.
ocrodjvu - A library and standalone tool for doing OCR on DjVu documents, wrapping Cuneiform, gocr, ocrad, ocropus and tesseract
tesserocr - A Python wrapper for the tesseract-ocr API

Javascript

ocracy - pure javascript lstm rnn implementation based on ocropus
gocr.js - Javascript port (emscripten) of gocr
ocrad.js - Javascript port (emscripten) of ocrad
tesseract.js - Javascript port (emscripten) of Tesseract
node-tesseract - A simple wrapper for the Tesseract OCR package.
node-tesseract-native - C++ module for node providing OCR with tesseract and leptonica.

Ruby

rtesseract - Ruby library wrapping the tesseract and imagemagick executables.
ruby-tesseract - Native Tesseract bindings for Ruby MRI and JRuby
ocr_space - API wrapper for free ocr service ocr.space. Includes CLI

Rust

tesseract.rs - Rust bindings for tesseract OCR.

R

tesseract - R bindings for tesseract OCR.

OCR training tools

glyph-miner - A system for extracting glyphs from early typeset prints
ocrodeg - Document image degradation for OCR data augmentation

Datasets

Ground Truth

archiscribe-corpus - >4,200 lines transcribed from 19th Century German prints via archiscribe CC-BY 4.0
CIS OCR Test Set - 2 example documents each in German/Latin/Greek with ground truth for PoCoTo

Rescribe - Transcriptions of Caroline Minuscule Manuscripts PDM 1.0

CLTK - Corpora from Classical Language Toolkit PDM 1.0
DIVA-HisDB - 150 pages^PAGE-XML of three medieval manuscripts CC-BY-NC 3.0
EarlyPrintedBooks - ~8,800 lines from several early printed books CC-BY-NC-SA 4.0
EEBO-TCP - 25,363 EEBO documents transcribed by TCP PDM 1.0
ECCO-TCP - 2,188 ECCO documents transcribed by TCP PDM 1.0
eMOP-TCP - 2,188 ECCO-TCP documents, cleaned up by eMOP PDM 1.0
Evans-TCP - 4,977 Evans documents transcribed by TCP
FDHN - Finnish Digitised Historical Newspapers, Paper, (free) registration required, Terms of Use
FROC-MSS - 4 Old French Medieval Manuscripts CC-BY 4.0
GERMANA - 764 Spanish manuscript pages, (free) registration required non-commercial use only
GT4HistOCR - Ground Truth for German Fraktur and Early Modern Latin CC-BY 4.0
imagessan - Sanskrit images & ground truth (Devanagari script)
IMPACT-BHL - 2,418 pages^PAGE-XML from the Biodiversity Heritage Library, XML@GitHub CC-BY 3.0
IMPACT-BL - 294 pages^PAGE-XML from the British Library, (free) registration required PDM 1.0
IMPACT-BNE - 215 pages^PAGE-XML from the National Library of Spain, (free) registration required, XML@GitHub CC-BY-NC-SA 4.0
IMPACT-BNF - 151 pages^PAGE-XML from the National Library of France, (free) registration required CC-BY-NC-SA 4.0
IMPACT-KB - 142 pages^PAGE-XML from the National Library of the Netherlands CC-BY 4.0
IMPACT-NKC - 187 pages^PAGE-XML from the Czech National Library, (free) registration required CC-BY-NC-SA 4.0
IMPACT-NLB - 19 pages^PAGE-XML from the National Library of Bulgaria, (free) registration required CC-BY-NC-ND 4.0
IMPACT-NUK - 209 pages^PAGE-XML from the National Library of Slovenia, (free) registration required CC-BY-NC-SA 4.0
IMPACT-PSNC - 478 pages^PAGE-XML from four Polish digital libraries, XML@GitHub CC-BY 3.0
LascivaRoma/lexical - Transcription of 19th century lexical resources for Latin learning
MJSynth - 9m synthetic images covering 90k English words
OCR19thSAC - 19,000 pages Swiss Alpine Club yearbooks transcribed via Text+Berg digital CC-BY 4.0
OCR-D - 180 pages^PAGE-XML of German historical prints from OCR-D CC-BY-SA 4.0
OCR_GS_Data - Double-checked Arabic Gold Standard from OpenITI
old-books - 322 old books from Project Gutenberg GPL 3.0
PRImA-ENP - 528 pages^PAGE-XML historic newspapers from Europeana Newspapers, (free) registration required PDM 1.0
RODRIGO - 853 Spanish manuscript pages, (free) registration required non-commercial use only
Toebler-OCR - (Kraken) Ground Truth transcription of few pages of the Tobler-Lommatzsch: Altfranzösisches Wörterbuch

Literature

OCR-related publication and link lists

IMPACT: Tools for text digitisation - List of tools software projects related, some related to OCR
OCR-D - List of OCR-related academic articles in the context of the OCR-D project. 🇩🇪
Mendeley Group "OCR - Optical Character Recognition" - Collection of 34 papers on OCR
eadh.org projects - List of Digital Humanities-related projects in Europe, some related to OCR
Wikipedia: Comparison of optical character recognition software
OCR [and Deep Learning] by @handong1587
Ocropus Wiki: Publications

Blog Posts and Tutorials

Tesseract Blends Old and New OCR Technology (2016) @theraysmith
- Tutorial@DAS2016, Updated "What You Always Wanted to Know" slides
What You Always Wanted To Know About Tesseract (2014) @theraysmith
- Tutorial@DAS2014, includes demos
Extracting text from an image using Ocropus (2015)
Training an Ocropus OCR model (2015) @danvk
Ocropus Wiki: Compute errors and confusions (2016) @zuphilip
Ocropus Wiki: Working with Ground Truth (2016) @zuphilip
OCRopus (2016) @jze
- mostly on column separation in ocropus
10 Tips for making your OCR project succeed (2013) @cneud
- general things to consider for OCR projects
Overview of LEADTOOLS Image Cleanup and Pre-processing SDK Technology -
- feature list for a commercial image pre-processing library; has nice before-after samples for pre-processing steps related to OCR
Extracting Text from PDFs; Doing OCR; all within R @shawngraham
- How to work with OCR from PDFs in the R programming environment
Tutorial: Command-line OCR on a Mac @bmschmidt
- Tutorial on how to run tesseract in Mac OSX
Practical Expercience with OCRopus Model Training (2016) @jze
Homemade Manuscript OCR (1): OCRopy (2017) @Jean-Baptiste-Camps
- Tutorial on applying OCR to medieval manuscripts with OCRopy
Optimizing Binarization for OCRopus (2017) @jze
Prototype demo for OCR postfix in Danish Newspapers (2016) @thomasegense
How Can I OCR My Dictionary? (2016) @JessedeDoes
"Needlessly complex" blog (2016) @mzucker. Several image processing how-tos (Python based), particularly:
- Page dewarping (code)
- Compressing and enhancing hand-written notes (code)
- Unprojecting text with ellipses (code)
(Open-Source-)OCR-Workflows (2017) @wrznr 🇩🇪 overview of the state of the art in open source OCR and related technologies (binarisation, deskewing, layout recognition, etc.), lots of example images and information on the @OCR-D project.
A gentle introduction to OCR (2018) @shgidi
Worauf kann ich mich verlassen? Arbeiten mit digitalisierten Quellen, Teil 1: OCR (2019) @eliaskreyenbuehl 🇩🇪 A reflection/criticism on OCR quality, OCR pitfalls in Fraktur fonts.

OCR Showcases

abbyy-finereader-ocr-senate - Using OCR to parse scanned Senate Financial Disclosure forms.
cvOCR - An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract
MathOCR - A printed scientific document recognition system, pre-alpha

Academic articles

mikegerber/awesome-ocr

Awesome OCR

Software

OCR engines

Older and possibly abandoned OCR engines

OCR file formats

hOCR

ALTO XML

TEI

OCR CLI

OCR GUI

OCR Preprocessing

OCR as a Service

OCR evaluation

OCR libraries by programming language

Go

Java

.Net

Object Pascal

PHP

Python

Javascript

Ruby

Rust

R

OCR training tools

Datasets

Ground Truth

Literature

OCR-related publication and link lists

Blog Posts and Tutorials

OCR Showcases

Academic articles

2011 and before

2012

2013

2014

2015

2016

2017

2018