/tesseractMICR

Ready-to-use Magnetic ink character recognition (MICR E-13B & CMC-7) datasets and *.traineddata for tesseract v4 + evaluation app

Primary LanguageC++BSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

!!New!!
SDK to detect and recognize MICR lines released at https://github.com/DoubangoTelecom/ultimateMICR-SDK


Table of Contents

  1. The dataset
  2. The models
  3. The recognizer app
  4. The accuracy
  5. Getting help

The dataset

The dataset contains more than #11 thousands images (.tif) with ground truth (.gt.txt) from real life augmented with few synthetic data.

The dataset is ready to be used with tesseract v4 for training.

The models

If you're lazy and don't want to train the model by yourself then, try the ones under tessdata_best (float-model) or tessdata_fast (int-model) folders.

The recognizer app

Most of the time when developing an ocr app using tesseract and you’re getting low accuracy it’s hard to determine if the issue is the model/traineddata or the image pre-processing. Off course you can dump the pre-processed image to see if it’s correctly binarized but this take time if you want to compute an accuracy score on thousands of images. To make your life easier this repo contains a command line application for Windows to test the accuracy.

This app is very easy to use:

  1. add your images in tesseractMICR/apps/images
  2. run tesseractMICR/apps/tesseract_recognizer.bat
  3. the predictions will be in tesseractMICR/apps/ocr.txt

This app will:

  1. detect MICR E-13B lines from anywhere on the image
  2. extract the lines, de-skew and de-slant them
  3. binarize the lines
  4. use Tesseract for recognition

You can edit tesseractMICR/apps/tesseract_recognizer.bat to change the path to the images or tessdata folders.

REM Usage: tesseract_recognizer.exe path_to_images_folder path_to_tessdata_folder
REM path_to_images_folder -> relative or absolute path to folder containing the images to process
REM path_to_tessdata_folder -> relative or absolute path to folder containing *.traineddata files
REM example: tesseract_recognizer.exe ./images ../tessdata_fast
REM another example: tesseract_recognizer.exe ./images ../tessdata_best

tesseract_recognizer.exe ./images ../tessdata_fast

The charset used in tesseractMICR/apps/ocr.txt is:

E13-B charset

This application is GPGPU accelerated using OpenCL. Make sure to update your drivers.

The accuracy

This was developed as an internal R&D project and never went to production as we ended using Tensorflow.

Even as a PoC (Proof-Of-Concept) it's already more accurate than all commercial products we've tested: LEADTOLS, accusoft, recogniform and abbyy. The repo contains a command line application to compare the accuracy (see above).

You can check our state of the art implementation based on Tensorflow at https://www.doubango.org/webapps/micr/

Getting help

To get help please check our discussion group or twitter account