tesseract-ocr/langdata

Add Latin Extended-A script for Polynesian languages

HURIMOZ opened this issue · 13 comments

Hi,
we work with Polynesian languages and we need to have the Latin Extended-A script installed.
Thanks in advance for your reply,
Tamatoa

Did you try Tesseract 4.0 with 'Latin' or 'lat' traineddata?

https://github.com/tesseract-ocr/tessdata_best
https://github.com/tesseract-ocr/tessdata_fast

Hi,
thanks for your reply.
I'm running Tesseract 3.03 with Leptonica, not from source code, on Ubuntu 14.
Can I install the latin traindata with this?

@HURIMOZ You can install the ppa for Tesseract4.0alpha for Ubuntu 14 from Alex's ppa - please see https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-ppa

The traineddata files referred by Amit will work with those.

In fact I don't need trained data for latin. I just need the system to recognize the Latin Extended-A script so it can render the macrons (diacritics) over the vowels: ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū.
Currently the system renders these vowels without the macrons, and my images are of very good quality.

I just need these ten characters: ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū.
Thanks

Hi, did you do something particular with these characters? Are they now included in a language pack?

You can try your own training. Otherwise you have to wait for @theraysmith to upload new langdata, traineddata etc.

@HURIMOZ Please try https://github.com/tesseract-ocr/tessdata_fast/raw/master/ton.traineddata for TONGA.

It has support for ā, ē and Ā, Ē.

@theraysmith Still needed support for the following for Polynesian Languages

ī, ō, ū, Ī, Ō, Ū.

In fact I don't need trained data for latin.

Latin.traineddata is for Latin script (not Latin language) and its unicharset has ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū.

Please try with 4.00 version of tesseract.

@HURIMOZ

As mentioned earlier, You can install the ppa for Tesseract4.0 for Ubuntu 14 from Alex's ppa - please see https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-ppa

I do not think there will be changes made for tesseract 3.0x traineddata files by Google. If you plan to use legacy tesseract, then you can try training for your particular requirements.