tesseract-ocr/langdata

Geresh and Gershayim are not included

yarons opened this issue 7 years ago · 11 comments

yarons commented 7 years ago

https://github.com/tesseract-ocr/langdata/blob/106c9b31bea9d30814fc116cbcb9c267dee7df70/heb/heb.training_text

I couldn't find the Hebrew punctuation Geresh or Gershayim in the following text.

https://en.wikipedia.org/wiki/Geresh
https://en.wikipedia.org/wiki/Gershayim

These were not widely used until pretty recently when a new keyboard layout was introduced.

amitdo commented 7 years ago

Duplicate of #82 (comment)

amitdo commented 7 years ago

Anyway, *.training_text files have not been updated for years.
They are automatically generated from a web corpus.

yarons commented 7 years ago

Is there a way to affect the scanned webpages?

amitdo commented 7 years ago

Yes, with some hints from other files.

I don't remember the fine details right now.

amitdo commented 7 years ago

https://github.com/tesseract-ocr/langdata/blob/master/ces/desired_characters

amitdo commented 7 years ago

The opposite:
https://github.com/tesseract-ocr/langdata/blob/master/ara/forbidden_characters

amitdo commented 7 years ago

I think 'desired_words' and 'forbidden_words' can also be used.

Shreeshrii commented 7 years ago

These lists are used in Ray's synthetic training data creation pipeline. As far as I know, the tesstrain.sh training process does not use them.

On Thu 5 Jul, 2018, 9:26 PM Amit D., ***@***.***> wrote: I think 'desired_words' and 'forbidden_words' can also be used. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#130 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o7IcVVjiarSlNnQ0hgEEvbAIH0Frks5uDjdHgaJpZM4VDvmc> .

amitdo commented 7 years ago

True.

amitdo commented 7 years ago

tesseract-ocr/tessdata#62 (comment)

theraysmith commented on Aug 3, 2017

FYI: The wordlists are generated files, so it isn't a good idea to modify them, as the modifications will likely get overwritten in a future training. To help prevent the ß/B confusion, the words that you want to lose from the wordlists need to go in langdata/lang/lang.bad_words.

So for undesired words a 'lang.bad_words' file should be used.

amitdo commented 7 years ago

vie has 'alphabet' file:
https://github.com/tesseract-ocr/langdata/blob/master/vie/alphabet