ground truth for enm.traineddata

Question

ground truth for enm.traineddata

whisere opened this issue 3 years ago · 7 comments

Hello, I wonder how the ground truth for training enm.traineddata was created, is it available to download?

Answer 1 · 2022-03-01T22:57:27.000Z

And which font was used to produced the long s in historical text? And what datasets/text (that containing ſ etc) were used to generate the ground truth? Are they generated automatically. Thanks.

Answer 2 · 2022-03-12T07:26:24.000Z

See https://github.com/tesseract-ocr/langdata_lstm/tree/main/enm. That's the only information which is available.

Answer 3 · 2022-03-12T07:44:08.000Z

Maybe you can try one of our models from https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/ (see also https://github.com/tesseract-ocr/tesstrain/wiki). I expect that they work good for historical English texts, too, maybe even better than the standard models enm or script/Fraktur.

Answer 4 · 2022-03-14T23:55:40.000Z

Many thanks @stweil ! I will check those models. I wonder if there are any links/contacts at Google I can enquire about the ground truth data used for training historical English texts? Thank you!

Answer 5 · 2022-03-27T23:09:32.000Z

I tried GT4HistOCR but it says
Failed to load any lstm-specific dictionaries for lang GT4HistOCR!!
Is there any GT4HistOCR.traineddata includes dictionary? Thanks.

Answer 6 · 2022-03-27T23:25:38.000Z

I guess we will need to produce fine tune model to include dictionary? can't just use it like the eng.traineddata (which already include dictionary)

Answer 7 · 2022-03-28T06:16:53.000Z

When a new model is trained, it never includes a dictionary. Older versions of Tesseract therefore show a message (which is only a hint, not an error). Either use such models without a dictionary (which usually works quite good), or add a dictionary yourself.