tesseract-ocr/tesseract

Various Hebrew fonts that need better recognition

Closed this issue · 3 comments

Various Hebrew fonts on this page have recognition issues with Tesseract...

https://opensiddur.org/help/fonts/

You can click on the font name there to see various characters that it contains and if the font includes "Nikkud" (vowel marks), those are shown too... so you can improve the support for "Nikkud" too, if you want.

They also have a PDF with sample text in those fonts.

I've tried it with Tesseract 4 using both heb.traineddata from here https://github.com/tesseract-ocr/tessdata and from here https://github.com/tesseract-ocr/tessdata_fast

Historical typefaces need better recognition... and maybe you can add support for cursive Hebrew fonts too (letters aren't joined and you don't have to worry about capital or lowercase letters, so it should be easier than supporting some Latin font that looks "handwritten"), since text in those styles can appear in some books too or even on webpages sometimes

Hi @MaxPower85!

Various Hebrew fonts on this page have recognition issues with Tesseract...

https://opensiddur.org/help/fonts/

I added a link to that page a while ago.
https://github.com/tesseract-ocr/tesseract/wiki/Fonts#hebrew-fonts

Related issue:
tesseract-ocr/langdata#82

BTW... I see you mentioned cantillation marks too...

I was just thinking about writing that it may be a good idea to test Hebrew OCR on texts which include cantillation marks, even if the cantillation marks are left out from the recognized text... so the OCR wouldn't get "confused" by those additional marks.

Although cantillation marks wouldn't usually be used for non-Biblical texts, there could be many books/articles which include some quotes from the Bible, so just to make sure that all words in those quotes get recognized correctly... and with growing popularity of mobile OCR apps that translate texts for those who can't read some language, someone will maybe use an OCR app to look up what some word in some Biblical passage means.

I'm not sure would there be a need to include cantillation marks in the recognized text, since people can easily find Biblical texts with cantillation marks online, if they need cantillation marks for some passage... but if someone is quoting some passage from the Bible, it would be good for OCR to be tested on texts with cantillation marks, so the words in that quote could be recognized more accurately.

If someone wants to read more about cantillation marks, here's an article about that: https://en.wikipedia.org/wiki/Cantillation

Usually, if a quote from the bible is used in a book or an article, the cantillation marks are dropped.