tesseract-ocr/langdata

Geresh and Gershayim are not included

yarons opened this issue · 11 comments

https://github.com/tesseract-ocr/langdata/blob/106c9b31bea9d30814fc116cbcb9c267dee7df70/heb/heb.training_text

I couldn't find the Hebrew punctuation Geresh or Gershayim in the following text.

https://en.wikipedia.org/wiki/Geresh
https://en.wikipedia.org/wiki/Gershayim

These were not widely used until pretty recently when a new keyboard layout was introduced.

Duplicate of #82 (comment)

Anyway, *.training_text files have not been updated for years.
They are automatically generated from a web corpus.

Is there a way to affect the scanned webpages?

Yes, with some hints from other files.

I don't remember the fine details right now.

I think 'desired_words' and 'forbidden_words' can also be used.

True.

tesseract-ocr/tessdata#62 (comment)

theraysmith commented on Aug 3, 2017

FYI: The wordlists are generated files, so it isn't a good idea to modify them, as the modifications will likely get overwritten in a future training. To help prevent the ß/B confusion, the words that you want to lose from the wordlists need to go in langdata/lang/lang.bad_words.

So for undesired words a 'lang.bad_words' file should be used.