Geresh and Gershayim are not included
yarons opened this issue · 11 comments
I couldn't find the Hebrew punctuation Geresh or Gershayim in the following text.
https://en.wikipedia.org/wiki/Geresh
https://en.wikipedia.org/wiki/Gershayim
These were not widely used until pretty recently when a new keyboard layout was introduced.
Duplicate of #82 (comment)
Anyway, *.training_text files have not been updated for years.
They are automatically generated from a web corpus.
Is there a way to affect the scanned webpages?
Yes, with some hints from other files.
I don't remember the fine details right now.
I think 'desired_words' and 'forbidden_words' can also be used.
True.
tesseract-ocr/tessdata#62 (comment)
theraysmith commented on Aug 3, 2017
FYI: The wordlists are generated files, so it isn't a good idea to modify them, as the modifications will likely get overwritten in a future training. To help prevent the ß/B confusion, the words that you want to lose from the wordlists need to go in langdata/lang/lang.bad_words.
So for undesired words a 'lang.bad_words' file should be used.
vie has 'alphabet' file:
https://github.com/tesseract-ocr/langdata/blob/master/vie/alphabet