ocrevalUAtion

This set of classes provides basic support to perform the comparison of two text files: a reference file (a ground-truth document) and a the output from an OCR engine (a text file).

Options for specific behavior include: ignore case, ignore diacritics, ignore punctuation, ignore stop-words, Unicode and user-defined equivalences between characters.

It can be used with the graphic user interface (GUI) provided, in addition to command line interface usage.

Supported input formats include: plain text, FineReader 10 XML, PAGE XML, ALTO XML and hOCR HTML.

The output generates a report with statistics (including CER and WER error rates) and a table with the parallell input texts where the differences are highlighted.

A gentle introduction to OCR evaluation and to this tool can be found at https://sites.google.com/site/textdigitisation/

You can download the latest release from here.

Instructions on how to use ocrevalUAtion can be found in the wiki.

mikegerber/ocrevalUAtion

ocrevalUAtion