Bernoulli trial generator tool for OCR result validation
This repo is part of the Aktienführer-Datenarchiv DFG project. The DFG recommends the Bernoulli trial to validate OCR results. To reduce the amount of effort to perform the test, a "Bernoulli Trial HTML Generator" was designed. This generator work with Abbyy-XML-Files (*.xml) or hocr-Files (*.hocr) and their JPG-image pendant (*.jpg).
This installation is tested with Ubuntu and we expect that it should work for other similar environments similarly.
- Python 2.7
git clone https://github.com/UB-Mannheim/BeTrial.git
cd BeTrial
$ virtualenv betrial_venv/
$ source betrial_venv/bin/activate
$ pip install -r requirements.txt
The whole projects has four major steps:
Load the files from the web ("filegetter.py").
$ python ./filegetter.py (+ parameters)
Create a set of files for the Bernoulli-Trials ("betrialgen.py")
$ python ./betrialgen.py -p /input/dir/*.xml or *.hocr (+ parameters)
Create an interactive Bernoulli-Trial html ("betrial.py").
$ python ./betrial.py /betrial/input/dir/*.png (+ parameters)
The betrial_eval html page helps to evaluate our results.
> firefox betrial_eval.html
Creating a dataset
$ python ./betrialgen.py
Creating the html-page
$ python ./betrial.py ./test/BeTrial/input/*.png
The validation page can be opened with firefox
> firefox out.html
You see the images of all the text lines from the dataset and
below each line there is the recognized text.
One of the character is marked with a red rectangle.
This character should be manually validated.
Therefore you can select one of the radio buttons below.
For example above you should check the letter j
in the word just
.
Which is Ok
.
The button Counts
displays an overview, about the current validation status.
The results can be stored with Export Gesamtergebnis
and Export Einzelergenisse
in csv
files.
To evaluate the results, open the betrial_evals.html
file:
> firefox betrial_eval.html
This page helps to calculate the cumulative distribution function (cdf).
And the necessary amount of successful events to proof the predicted accuracy considering a certain error-probability.
Copyright (c) 2019 Universitätsbibliothek Mannheim
Author:
BeTrial is Free Software. You may use it under the terms of the Apache 2.0 License. See LICENSE for details.
The tools are depending on some third party libraries:
- ocropy is a collection of document analysis programs. One of them is ocropus-gtedit which builds the basis of the "betrial.py" source code. ocropus-gtedit produces an editable html-page, where you can see the images of all the text lines and below each line the recognized text. The recognized text can be updated to produce ground truth data.
- Export2CSV export the data to csv.
- Calculating Binom the calculation in the evaluation page are based on the implementation of Terry Ritter.