Training workflow for Tesseract 4 as a Makefile for dependency tracking and building the required software from source.
You will need a recent version (>= 4.0.0beta1) of tesseract built with the training tools and matching leptonica bindings. Build instructions and more can be found in the Tesseract project wiki.
Alternatively, you can build leptonica and tesseract within this project and install it to a subdirectory ./usr
in the repo:
make leptonica tesseract
Tesseract will be built from the git repository, which requires CMake, autotools (including autotools-archive) and some additional libraries for the training tools. See the installation notes in the tesseract repository.
Place ground truth consisting of line images and transcriptions in the folder
data/ground-truth
. This list of files will be split into training and
evaluation data, the ratio is defined by the RATIO_TRAIN
variable.
Images must be TIFF and have the extension .tif
.
Transcriptions must be single-line plain text and have the same name as the
line image but with .tif
replaced by .gt.txt
.
The repository contains a ZIP archive with sample ground truth, see
ocrd-testset.zip. Extract it to ./data/ground-truth
and run
make training
.
NOTE: If you want to generate line images for transcription from a full page, see tips in issue 7 and in particular @Shreeshrii's shell script.
make training MODEL_NAME=name-of-the-resulting-model
which is basically a shortcut for
make unicharset lists proto-model training
Run make help
to see all the possible targets and variables:
Targets
unicharset Create unicharset
lists Create lists of lstmf filenames for training and eval
training Start training
proto-model Build the proto model
leptonica Build leptonica
tesseract Build tesseract
tesseract-langs Download tesseract-langs
clean Clean all generated files
Variables
MODEL_NAME Name of the model to be built. Default: foo
START_MODEL Name of the model to continue from. Default: ''
PROTO_MODEL Name of the proto model. Default: 'data/foo/foo.traineddata'
CORES No of cores to use for compiling leptonica/tesseract. Default: 4
LEPTONICA_VERSION Leptonica version. Default: 1.75.3
TESSERACT_VERSION Tesseract commit. Default: fd492062d08a2f55001a639f2015b8524c7e9ad4
TESSDATA_REPO Tesseract model repo to use. Default: _fast
GROUND_TRUTH_DIR Ground truth directory. Default: data/ground-truth
NORM_MODE Normalization Mode - see src/training/language_specific.sh for details. Default: 2
PSM Page segmentation mode. Default: 6
RATIO_TRAIN Ratio of train / eval training data. Default: 0.90
Software is provided under the terms of the Apache 2.0
license.
Sample training data provided by Deutsches Textarchiv is in the public domain.