tesseract-ocr/langdata

Improve yor.traineddata for Yoruba

Shreeshrii opened this issue · 9 comments

@theraysmith

See https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/RF1rk3-z4uo/noQzBWbuCAAJ

Message from @Timilehin copied below

I am working on a side project in yoruba that might be helpful. It predicts the right diacritics on unmarked yoruba words. I imagine you could also run the OCR allowing only unmarked characters as output (maybe reduce the height of the scan window so it doesn't see the diacritics) and then pipe the marked characters through the tool I'm building and use the output as a fallback for when the image recognition is not sure.

My project right now needs more training data to make the model more robust. It is very tough to find properly marked yoruba text on the internet. I have physical books and some scanned pdfs in archive.org that I would want to transform to text but the yor.traineddata doesn't seem robust enough. It makes many mistakes such as ọdọ instead of ẹdẹ.
Other times, it just spits out gibberish.
What can I provide to help make yor.traineddata much better and what quantity? (e.g. 200 (pages) images of yourba text and the yoruba text it contains).I think both projects an reinforce each other. I look forward to hearing back.

link to proj -> https://github.com/Timilehin/Yoruba-Intonator

http://crubadan.org/languages/yo

for Yoruba - An Crúbadán - Corpus Building for Minority Languages

Thanks @Shreeshrii for creating an issue for this. I looked at the crubadan corpus. Most of the urls it scrapes from contain Yoruba that is not properly marked. Given the high noise to signal ratio, I don't think it will be good to train with that (or most web scraped data).

I currently have 2 websites that reliably always have properly marked Yoruba. I am thinking of taking screen shots of the text and also passing in the text in text form. I think this will be a good starting point to improve the model. Does this idea sound good?

Making screenshots is not very useful. You need the text itself. A web crawler is what you need to use.

Please list the URLs of those two sites.

Did you try to extract the wordlist from the yor traineddata and examine it?

@amitdo I meant my last message in the context of useful training data for tesseract' yor.traineddata, not my project. Please confirm that this OCR system takes in only text and not also images to train its models to predict what texts an image contains. .

The urls are:

  1. http://www.theyorubablog.com
  2. https://www.jw.org/yo/
  3. https://yo.m.wikipedia.org/wiki/Èdè_Yorùbá

Wikipedia (3) only has marked Yoruba for that first page. Every page it links to (and every other page on yo.wikipedia.com that I've seen) is not properly marked. This is not the case for 1 and 2.

The images for trained data are created by the text2image tool. It renders images from text files using variety of digital fonts.

Ah, I see. I probably should have read the docs more carefully. But that's very interesting. I won't have thought to do that.
In that case, you can have all my hand picked, fresh and fully marked yoruba corpus (harvested from those three sites) here ->
https://github.com/Timilehin/Yoruba-Intonator/blob/master/yoruba_sentences.txt

The only thing to note is that I broke them down into one sentence per line. I hope that doesn't affect the model. I will keep adding more as I find them.

Any updates on this? Anything I can be doing on my end?

I am hoping that @theraysmith will include your resources for his next training.

Any updates on this?