CIS OCR-D command line tools for the automatic post-correction of OCR-results.
ocrd_cis
contains different tools for the automatic post correction
of OCR-results. It contains tools for the training, evaluation and
execution of the post correction. Most of the tools are following the
OCR-D cli conventions.
There is a helper tool to align multiple OCR results as well as a version of ocropy that works with python3.
There are multiple ways to install the ocrd_cis
tools:
make install
usespip
to installocrd_cis
(see below).make install-devel
usespip -e
to installocrd_cis
(see below).pip install --upgrade pip ocrd_cis_dir
pip install -e --upgrade pip ocrd_cis_dir
It is possible to install ocrd_cis
in a custom directory using
virtualenv
:
python3 -m venv venv-dir
source venv-dir/bin/activate
make install # or any other command to install ocrd_cis (see above)
# use ocrd_cis
deactivate
Most tools follow the OCR-D cli
conventions. They accept the
--input-file-grp
, --output-file-grp
, --parameter
, --mets
,
--log-level
command line arguments (short and long). For some tools
(most notably the alignment tool) expect a comma seperated list of
multiple input file groups.
The ocrd-tool.json contains a schema
description of the parameter config file for the different tools that
accept the --parameter
argument.
This bash script runs the post correction using a pre-trained model. If additional support OCRs should be used, models for these OCR steps are required and must be configured in an according configuration file (see ocrd-tool.json).
Arguments:
--parameter
path to configuration file--input-file-grp
name of the master-OCR file group--output-file-grp
name of the post-correction file group--log-level
set log level--mets
path to METS file in workspace
Aligns tokens of multiple input file groups to one output file group. This tool is used to align the master OCR with any additional support OCRs. It accepts a comma-separated list of input file groups, which it aligns in order.
Arguments:
--parameter
path to configuration file--input-file-grp
comma seperated list of the input file groups; first input file group is the master OCR--output-file-grp
name of the file group for the aligned result--log-level
set log level--mets
path to METS file in workspace
Script to train a model from a list of ground-truth archives (see ocrd-tool.json) for the post correction. The tool somewhat mimics the behaviour of other ocrd tools:
--mets
for the workspace--log-level
is passed to other tools--parameter
is used as configuration--output-file-grp
defines the output file group for the model
Helper tool to get the path of the installed data files. Usage:
ocrd-cis-data [-jar|-3gs]
to get the path of the jar library or the
path to th default 3-grams language model file.
Helper tool to calculate the word error rate aligned ocr files. It writes a simple JSON-formated stats file to the given output file group.
Arguments:
--input-file-grp
input file group of aligned ocr results with their respective ground truth.--output-file-grp
name of the file group for the stats file--log-level
set log level--mets
path to METS file in workspace
Run the profiler over the given files of the according the given input file grp and adds a gzipped JSON-formatted profile to the output file group of the workspace. This tools requires an installed language profiler.
Arguments:
--parameter
path to configuration file--input-file-grp
name of the input file group to profile--output-file-grp
name of the output file group where the profile is stored--log-level
set log level--mets
path to METS file in the workspace
The ocropy-train tool can be used to train LSTM models. It takes ground truth from the workspace and saves (image+text) snippets from the corresponding pages. Then a model is trained on all snippets for 1 million (or the given number of) randomized iterations from the parameter file.
ocrd-cis-ocropy-train \
--input-file-grp OCR-D-GT-SEG-LINE \
--mets mets.xml
--parameter file:///path/to/config.json
The ocropy-clip tool can be used to remove intrusions of neighbouring segments in regions / lines of a workspace. It runs a (ad-hoc binarization and) connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to white. It references the resulting segment image files in the output PAGE (as AlternativeImage).
ocrd-cis-ocropy-clip \
--input-file-grp OCR-D-SEG-LINE \
--output-file-grp OCR-D-SEG-LINE-CLIP \
--mets mets.xml
--parameter file:///path/to/config.json
The ocropy-resegment tool can be used to remove overlap between lines of a workspace. It runs a (ad-hoc binarization and) line segmentation on every text region of every PAGE in the input file group, and for each line already annotated, determines the label of largest extent within the original coordinates (polygon outline) in that line, and annotates the resulting coordinates in the output PAGE.
ocrd-cis-ocropy-resegment \
--input-file-grp OCR-D-SEG-LINE \
--output-file-grp OCR-D-SEG-LINE-RES \
--mets mets.xml
--parameter file:///path/to/config.json
The ocropy-segment tool can be used to segment regions into lines. It runs a (ad-hoc binarization and) line segmentation on every text region of every PAGE in the input file group, and adds a TextLine element with the resulting polygon outline to the annotation of the output PAGE.
ocrd-cis-ocropy-segment \
--input-file-grp OCR-D-SEG-BLOCK \
--output-file-grp OCR-D-SEG-LINE \
--mets mets.xml
--parameter file:///path/to/config.json
The ocropy-deskew tool can be used to deskew pages / regions of a workspace. It runs the Ocropy thresholding and deskewing estimation on every segment of every PAGE in the input file group and annotates the orientation angle in the output PAGE.
ocrd-cis-ocropy-deskew \
--input-file-grp OCR-D-SEG-LINE \
--output-file-grp OCR-D-SEG-LINE-DES \
--mets mets.xml
--parameter file:///path/to/config.json
The ocropy-denoise tool can be used to despeckle pages / regions / lines of a workspace. It runs the Ocropy "nlbin" denoising on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage).
ocrd-cis-ocropy-denoise \
--input-file-grp OCR-D-SEG-LINE-DES \
--output-file-grp OCR-D-SEG-LINE-DEN \
--mets mets.xml
--parameter file:///path/to/config.json
The ocropy-binarize tool can be used to binarize, denoise and deskew pages / regions / lines of a workspace. It runs the Ocropy "nlbin" adaptive thresholding, deskewing estimation and denoising on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.
ocrd-cis-ocropy-binarize \
--input-file-grp OCR-D-SEG-LINE-DES \
--output-file-grp OCR-D-SEG-LINE-BIN \
--mets mets.xml
--parameter file:///path/to/config.json
The ocropy-dewarp tool can be used to dewarp text lines of a workspace. It runs the Ocropy baseline estimation and dewarping on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as AlternativeImage).
ocrd-cis-ocropy-dewarp \
--input-file-grp OCR-D-SEG-LINE-BIN \
--output-file-grp OCR-D-SEG-LINE-DEW \
--mets mets.xml
--parameter file:///path/to/config.json
The ocropy-recognize tool can be used to recognize lines / words / glyphs from pages of a workspace. It runs the Ocropy optical character recognition on every line in every text region of every PAGE in the input file group and adds the resulting text annotation in the output PAGE.
ocrd-cis-ocropy-recognize \
--input-file-grp OCR-D-SEG-LINE-DEW \
--output-file-grp OCR-D-OCR-OCRO \
--mets mets.xml
--parameter file:///path/to/config.json
Install essential system packages for Tesserocr
sudo apt-get install python3-tk \
tesseract-ocr libtesseract-dev libleptonica-dev \
libimage-exiftool-perl libxml2-utils
Then install Tesserocr from: https://github.com/OCR-D/ocrd_tesserocr
pip install -r requirements.txt
pip install .
Download and move tesseract models from: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files or use your own models and place them into: /usr/share/tesseract-ocr/4.00/tessdata
A decent pipeline might look like this:
- page-level cropping
- page-level binarization
- page-level deskewing
- page-level dewarping
- region segmentation
- region-level clipping
- region-level deskewing
- line segmentation
- line-level clipping or resegmentation
- line-level dewarping
- line-level recognition
- line-level alignment
If GT is used, steps 1, 5 and 8 can be omitted. Else if a segmentation is used in 5 and 8 which does not produce overlapping sections, steps 6 and 9 can be omitted.
To run a few basic tests type make test
(ocrd_cis
has to be
installed in order to run any tests).
- Create a new (empty) workspace:
ocrd workspace init workspace-dir
- cd into
workspace-dir
- Add new file to workspace:
ocrd workspace add file -G group -i id -m mimetype