Tesseract training "in-a-box". Just upload some fonts and run it!
- Put fonts (TTF only supported currently) into
/opt/ocrbox/fonts - Run
bin/trainfrom the/opt/ocrboxdirectory - The new language file will be installed to
/opt/tessdataand also left in/opt/ocrbox - Use
bin/cleanto reset everything (recommended when changing the training set)
bin/train does the following:
- Reads the list of fonts
- Runs
text2imageon each to generate tif/box files - Trains Tesseract on each tif/box pair
- Generates the unicharset file for all the boxes
- Runs the actual training
The bin/train script defaults to eng as the langauge - you can change this by editing the variable at the top of the file.
Most fonts seem to be in the format FontFamilyName-VariantBits, however some are not! We actually use a proper TTF library to extract the details, but the name cannot contain spaces. If you're using fonts which do, run python files/fix_fonts.py before training.