Add Latin Extended-A script for Polynesian languages

Question

Add Latin Extended-A script for Polynesian languages

HURIMOZ opened this issue 7 years ago · 13 comments

Hi,
we work with Polynesian languages and we need to have the Latin Extended-A script installed.
Thanks in advance for your reply,
Tamatoa

Answer 1 · 2017-10-15T11:07:26.000Z

Did you try Tesseract 4.0 with 'Latin' or 'lat' traineddata?

https://github.com/tesseract-ocr/tessdata_best
https://github.com/tesseract-ocr/tessdata_fast

Answer 2 · 2017-10-15T12:16:02.000Z

Hi,
thanks for your reply.
I'm running Tesseract 3.03 with Leptonica, not from source code, on Ubuntu 14.
Can I install the latin traindata with this?

Answer 3 · 2017-10-15T13:03:01.000Z

See here:
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-304305

Answer 4 · 2017-10-15T13:34:46.000Z

@HURIMOZ You can install the ppa for Tesseract4.0alpha for Ubuntu 14 from Alex's ppa - please see https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-ppa

The traineddata files referred by Amit will work with those.

Answer 5 · 2017-10-16T00:08:44.000Z

In fact I don't need trained data for latin. I just need the system to recognize the Latin Extended-A script so it can render the macrons (diacritics) over the vowels: ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū.
Currently the system renders these vowels without the macrons, and my images are of very good quality.

Answer 6 · 2017-10-16T07:11:27.000Z

Please make a list of the additional characters needed, if whole extended-a range is not needed.

…

On 16-Oct-2017 5:38 AM, "Huri Translations" ***@***.***> wrote: In fact I don't need trained data for latin. I just need the system to recognize the Latin Extended-A script so it can render the macrons (diacritics) over the vowels: ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū. Currently the system renders these vowels without the macrons, and my images are of very good quality. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#97 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o15DCQLcKSbekblxEd57641gLHl2ks5ssp6NgaJpZM4P5tCt> .

Answer 7 · 2017-10-16T07:17:36.000Z

I just need these ten characters: ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū.
Thanks

Answer 8 · 2017-10-25T10:20:21.000Z

Hi, did you do something particular with these characters? Are they now included in a language pack?

Answer 9 · 2017-10-25T11:01:41.000Z

You can try your own training. Otherwise you have to wait for @theraysmith to upload new langdata, traineddata etc.

Answer 10 · 2018-02-23T09:47:34.000Z

@HURIMOZ Please try https://github.com/tesseract-ocr/tessdata_fast/raw/master/ton.traineddata for TONGA.

It has support for ā, ē and Ā, Ē.

@theraysmith Still needed support for the following for Polynesian Languages

ī, ō, ū, Ī, Ō, Ū.

Answer 11 · 2018-02-23T10:00:07.000Z

In fact I don't need trained data for latin.

Latin.traineddata is for Latin script (not Latin language) and its unicharset has ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū.

Please try with 4.00 version of tesseract.

Answer 12 · 2018-02-23T22:16:33.000Z

Iʻm using ubuntu 14 so canʻt use tesseract 4.00 Tamatoa AUDOUIN +689 89 205 483 +1 (213) 457 3137 info@huri-translations.pf www.huri-translations.pf The Power of Languages & Polynesian Imagery Huri Translations PO BOX 365 Maharepa 98728 Mo'orea, French Polynesia N° TAHITI: 876649 This message and its attachments may contain legally privileged or confidential information. It is intended solely for the named addressee. If you are not the addressee indicated in this message (or responsible for delivery of the message to the addressee), you may not copy or deliver this message or its attachments to anyone. Rather, you should permanently delete this message and its attachments and kindly notify the sender by reply e-mail. Any content of this message and its attachments which does not relate to the official business of the sending company must be taken not to have been sent or endorsed by that company or any of its related entities. No warranty is made that the e-mail or attachment(s) are free from computer virus or other defect.On Shreeshrii <notifications@github.com>, Feb 23, 2018 00:00 wrote: In fact I don't need trained data for latin. Latin.traineddata is for Latin script (not Latin language) and its unicharset has ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū. Please try with 4.00 version of tesseract. —You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread. {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/tesseract-ocr/langdata","title":"tesseract-ocr/langdata","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/tesseract-ocr/langdata"}},"updates":{"snippets":[{"icon":"PERSON","message":"@Shreeshrii in #97: \u003eIn fact I don't need trained data for latin.\r\n\r\nLatin.traineddata is for Latin script (not Latin language) and its unicharset has ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū.\r\n\r\nPlease try with 4.00 version of tesseract."}],"action":{"name":"View Issue","url":"#97 (comment)"}}}

Answer 13 · 2018-02-24T03:41:36.000Z

@HURIMOZ

As mentioned earlier, You can install the ppa for Tesseract4.0 for Ubuntu 14 from Alex's ppa - please see https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-ppa

I do not think there will be changes made for tesseract 3.0x traineddata files by Google. If you plan to use legacy tesseract, then you can try training for your particular requirements.