OCR

Java implementation of Optical Character Recognition

How It Works

The core concept, at the character level, is image matching with automatic position and aspect ratio correction, using a least-square-error matching algorithm.

Phases

Training Phase

Printing out the characters which it is expected to recognize
Scanning those characters into an image
Cropping the image down so that it includes only the training characters
Telling the OCR engine to use the resulting training image, and specifying which characters the image contains

Character Recognition

Load training images
Load the scanned image of the document to be converted to text
Convert the scanned image to grayscale
Filter the scanned image using a low-pass Finite Impulse Response (FIR) filter to remove dust
Break the document into lines of text, based on whitespace between the text lines
Break each line into characters, based on whitespace between the characters; using the average character width, determine where spaces occur within the line
For each character, determine the most closely matching character from the training images and append that to the output text; for each space, append a space character to the output text
Output the accumulated text
If there are any more scanned images to be converted to text, return to step 2

steventhanna/OCR

OCR

How It Works

Phases

Training Phase

Character Recognition