tesseract-ocr/tessdata

ground truth for enm.traineddata

whisere opened this issue · 7 comments

Hello, I wonder how the ground truth for training enm.traineddata was created, is it available to download?

And which font was used to produced the long s in historical text? And what datasets/text (that containing ſ etc) were used to generate the ground truth? Are they generated automatically. Thanks.

See https://github.com/tesseract-ocr/langdata_lstm/tree/main/enm. That's the only information which is available.

Maybe you can try one of our models from https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/ (see also https://github.com/tesseract-ocr/tesstrain/wiki). I expect that they work good for historical English texts, too, maybe even better than the standard models enm or script/Fraktur.

Many thanks @stweil ! I will check those models. I wonder if there are any links/contacts at Google I can enquire about the ground truth data used for training historical English texts? Thank you!

I tried GT4HistOCR but it says
Failed to load any lstm-specific dictionaries for lang GT4HistOCR!!
Is there any GT4HistOCR.traineddata includes dictionary? Thanks.

I guess we will need to produce fine tune model to include dictionary? can't just use it like the eng.traineddata (which already include dictionary)

When a new model is trained, it never includes a dictionary. Older versions of Tesseract therefore show a message (which is only a hint, not an error). Either use such models without a dictionary (which usually works quite good), or add a dictionary yourself.