Multilingual OCR Development Plan

Question

Multilingual OCR Development Plan

D-DanielYang opened this issue 4 years ago · 72 comments

model name	description	model size	download	Update Date
ch	Chinese and English	3.71M	inference model / trained model	2020.9.22
ch_tra	chinese traditional	5.63M	inference model / trained model	2021.1.21
en	English	2.56M	inference model / trained model	2020.9.22
fr	French	2.65M	inference model / trained model	2021.9.22
ar	Arabic	2.53M	inference model / trained model	2021.1.21
es	Spanish	2.53M	inference model / trained model	2021.1.21
pt	Portuguese	2.63M	inference model / trained model	2021.1.21
ru	Russia	2.63M	inference model / trained model	2021.1.21
ge	german	2.65M	inference model / trained model	2020.9.22
kr	Korean	3.9M	inference model / trained model	2020.9.22
jp	Japanese	4.23M	inference model / trained model	2020.9.22
it	Italian	2.53M	inference model / trained model	2021.1.21
hi	Hindi	2.63M	inference model / trained model	2021.1.21
ug	Uyghur	2.63M	inference model / trained model	2021.1.21
fa	Persian	2.63M	inference model / trained model	2021.1.21
ur	Urdu	2.63M	inference model / trained model	2021.1.21
oc	Occitan	2.53M	inference model / trained model	2021.1.21
mr	Marathi	2.63M	inference model / trained model	2021.1.21
ne	Nepali	2.63M	inference model / trained model	2021.1.21
rs_cyrillic	Serbian(cyrillic)	2.63M	inference model / trained model	2021.1.21
rs_latin	Serbian(latin)	2.53M	inference model / trained model	2021.1.21
bg	Bulgarian	2.63M	inference model / trained model	2021.1.21
uk	Ukranian	2.63M	inference model / trained model	2021.1.21
be	Belarusian	2.63M	inference model / trained model	2021.1.21
te	Telugu	2.63M	inference model / trained model	2021.1.21
kn	Kannada	2.63M	inference model / trained model	2021.1.21
ta	Tamil	2.63M	inference model / trained model	2021.1.21
mg	Mongolian	--	Ongoing
bg	Bangla	--	Need dict and corpus
bm	Burmese	--	Need dict and corpus	call for contribution
ku_cent	kurdish central	--	PR8347	call for contribution
od	Odia	--	PR6348	call for contribution
th	thai	--	PR6719 issue chat	call for contribution
	More		TBC

Guideline for new language requests

If you want to request a new language support, a PR with 2 following files are needed：

In folder ppocr/utils/dict,
it is necessary to submit the dict text to this path and name it with {language}_dict.txt that contains a list of all characters. Please see the format example from other files in that folder.
In folder ppocr/utils/corpus,
it is necessary to submit the corpus to this path and name it with {language}_corpus.txt that contains a list of words in your language.
Maybe, 50000 words per language is necessary at least.
Of course, the more, the better.
call for contributions to add new language support for PaddleOCR.
For anyone might be insterested in traing the new language model, Guidance to train the model is provided. We are calling contributions to add new language support for PaddleOCR.

If your language has unique elements, please tell me in advance within any way, such as useful links, wikipedia and so on.

Answer 1 · 2020-11-02T03:41:31.000Z

Traditional Mongolian

Answer 2 · 2020-11-08T07:41:48.000Z

I would love to work on "Bangla"

Answer 3 · 2020-11-10T08:58:05.000Z

I very happy if you do that with Vietnamese

Answer 4 · 2020-11-10T22:02:49.000Z

How about Arabic? That would be great.

Answer 5 · 2020-11-18T08:05:22.000Z

I've find out that PADDLE OCR algorithm cannot recognize some special characters (such as comma, semicolon, or dot...) when the language is english. Is there any possible way that i can fix this problem

Answer 6 · 2020-11-27T22:50:17.000Z

I would like to contribute to add the Burmese language. Is it only needed to submit two text files - dict & corpus? How further process do we need to provide?

Answer 7 · 2020-11-28T02:31:07.000Z

Adding "Bangla" will be grate for the people in south Asia

Answer 8 · 2020-12-07T02:47:05.000Z

Adding "Traditional Chinese (zh-TW)" would be great support.

Answer 9 · 2020-12-07T10:50:52.000Z

Do you have preTrained Russian recognition model?

Answer 10 · 2020-12-21T16:16:32.000Z

Hi adding " Tamil" language will be very grateful.

Tamil_dict.txt
Tamil_corpus.txt

Need more help plz refer this issue:
JaidedAI/EasyOCR#39

Answer 11 · 2020-12-24T07:19:49.000Z

I can help with Turkish language.

Answer 12 · 2021-01-03T20:26:02.000Z

I can help with polish language.

Answer 13 · 2021-01-26T05:29:26.000Z

@GmGniap Hello, Can you provide the corpus file of Burmese Language？

Answer 14 · 2021-01-26T06:36:58.000Z

@shahidul56 Hello, Can you provide the corpus file of Bangla Languag？

Answer 15 · 2021-01-26T10:08:41.000Z

All models updated in 2021.1.21 cannot be downloaded with following Error：
{ code: "NoSuchKey", message: "The specified key does not exist.", requestId: "aa1bfeff-f572-40aa-8935-6129b1533ed1" }

Answer 16 · 2021-01-27T08:49:54.000Z

All models updated in 2021.1.21 cannot be downloaded with following Error：
{ code: "NoSuchKey", message: "The specified key does not exist.", requestId: "aa1bfeff-f572-40aa-8935-6129b1533ed1" }

Sorry for the invalid links and all of them have been revised now, you can try again.

Answer 17 · 2021-01-27T19:44:52.000Z

I very happy if you do that with Vietnamese

#1847, seems to be ongoing.

Answer 18 · 2021-01-28T06:36:42.000Z

@redcinelli Thank you very much. The Vietnamese model is in training and will be available soon~

Answer 19 · 2021-01-28T07:06:24.000Z

model name description model size download Update Date
ch Chinese and English 3.71M inference model / trained model 2020.9.22
cht chinese traditional 5.63M inference model / trained model 2021.1.21
en English 2.56M inference model / trained model 2020.9.22
fr French 2.65M inference model / trained model 2021.9.22
ar Arabic 2.53M inference model / trained model 2021.1.21
xi Spanish 2.53M inference model / trained model 2021.1.21
pu Portuguese 2.63M inference model / trained model 2021.1.21
ru Russia 2.63M inference model / trained model 2021.1.21
ge german 2.65M inference model / trained model 2020.9.22
kr Korean 3.9M inference model / trained model 2020.9.22
jp Japanese 4.23M inference model / trained model 2020.9.22
it Italian 2.53M inference model / trained model 2021.1.21
hi Hindi 2.63M inference model / trained model 2021.1.21
ug Uyghur 2.63M inference model / trained model 2021.1.21
fa Persian 2.63M inference model / trained model 2021.1.21
ur Urdu 2.63M inference model / trained model 2021.1.21
rs Serbian(latin) 2.53M inference model / trained model 2021.1.21
oc Occitan 2.53M inference model / trained model 2021.1.21
mr Marathi 2.63M inference model / trained model 2021.1.21
ne Nepali 2.63M inference model / trained model 2021.1.21
rsc Serbian(cyrillic) 2.63M inference model / trained model 2021.1.21
bg Bulgarian 2.63M inference model / trained model 2021.1.21
uk Ukranian 2.63M inference model / trained model 2021.1.21
be Belarusian 2.63M inference model / trained model 2021.1.21
te Telugu 2.63M inference model / trained model 2021.1.21
ka Kannada 2.63M inference model / trained model 2021.1.21
ta Tamil 2.63M inference model / trained model 2021.1.21
mg Mongolian -- Ongoing
bg Bangla -- Need dict and corpus
vi Vietnamese -- Need dict and corpus
bm Burmese -- Need dict and corpus
tk Turkish -- Need dict and corpus
po polish -- Need dict and corpus
More TBC

Guideline for new language requests

If you want to request a new language support, a PR with 2 following files are needed：
1. In folder [ppocr/utils/dict](./ppocr/utils/dict),
   it is necessary to submit the dict text to this path and name it with `{language}_dict.txt` that contains a list of all characters. Please see the format example from other files in that folder.

2. In folder [ppocr/utils/corpus](./ppocr/utils/corpus),
   it is necessary to submit the corpus to this path and name it with `{language}_corpus.txt` that contains a list of words in your language.
   Maybe, 50000 words per language is necessary at least.
   Of course, the more, the better.
If your language has unique elements, please tell me in advance within any way, such as useful links, wikipedia and so on.

@grasswolfs model name for Turkish should be "tr" instead of "tk", it is the widely used abbreviation for Turkish.

Answer 20 · 2021-01-28T07:07:24.000Z

I have also opened a pr for Turkish dict and corpora: #1856

Answer 21 · 2021-02-02T03:42:00.000Z

Thanks @habout632 for adding Southeast Asian languages via #1896

Answer 22 · 2021-03-16T14:22:56.000Z

Here is a dictionary for Greek.
el_dict.txt

Answer 23 · 2021-03-16T17:10:20.000Z

Hi , did we have a model to detect all English characters along with special characters like.,"()

Answer 24 · 2021-05-12T08:44:52.000Z

hi, thank you for the great work! I just wonder whether you will add traditional Chinese to the general model? Right now, the general model can support Chinese(sim), English and numbers.

Answer 25 · 2021-05-15T06:29:59.000Z

Hi, can we give line data above 50 max_char_length data for training?
after training rec model on 25 char length as well as 50 char length found that 25 char length less loss and good acc but 50 char length data more loose and less acc please find sample devnagri data

train_img/0022_BindiyaKiAathmakatha_Img_300_Org_Page_0001_crop_9.jpg बीत गया । असमय के इस बुढ़ापे की देहली पर बैठी, मौत की
train_img/0022_BindiyaKiAathmakatha_Img_300_Org_Page_0001_crop_10.jpg प्रतीक्षा कर रही हूँ । पर लगाता है उसने भी सबों के साथ-साथ

Answer 26 · 2021-06-24T13:39:55.000Z

After downloading the inference and Trained model, how can I use them ?
Can anyone point out some resources of Testing / Evaluating code using these models

Thanks

Answer 27 · 2021-06-29T08:43:20.000Z

请问有计划开发一个统一模型，支持多语种文字混合排版的图片的识别吗？谢谢。

Answer 28 · 2021-07-23T16:15:43.000Z

Traditional Mongolian 👀

Answer 29 · 2021-08-11T13:14:44.000Z

model name description model size download Update Date
ch Chinese and English 3.71M inference model / trained model 2020.9.22
ch_tra chinese traditional 5.63M inference model / trained model 2021.1.21
en English 2.56M inference model / trained model 2020.9.22
fr French 2.65M inference model / trained model 2021.9.22
ar Arabic 2.53M inference model / trained model 2021.1.21
es Spanish 2.53M inference model / trained model 2021.1.21
pt Portuguese 2.63M inference model / trained model 2021.1.21
ru Russia 2.63M inference model / trained model 2021.1.21
ge german 2.65M inference model / trained model 2020.9.22
kr Korean 3.9M inference model / trained model 2020.9.22
jp Japanese 4.23M inference model / trained model 2020.9.22
it Italian 2.53M inference model / trained model 2021.1.21
hi Hindi 2.63M inference model / trained model 2021.1.21
ug Uyghur 2.63M inference model / trained model 2021.1.21
fa Persian 2.63M inference model / trained model 2021.1.21
ur Urdu 2.63M inference model / trained model 2021.1.21
oc Occitan 2.53M inference model / trained model 2021.1.21
mr Marathi 2.63M inference model / trained model 2021.1.21
ne Nepali 2.63M inference model / trained model 2021.1.21
rs_cyrillic Serbian(cyrillic) 2.63M inference model / trained model 2021.1.21
rs_latin Serbian(latin) 2.53M inference model / trained model 2021.1.21
bg Bulgarian 2.63M inference model / trained model 2021.1.21
uk Ukranian 2.63M inference model / trained model 2021.1.21
be Belarusian 2.63M inference model / trained model 2021.1.21
te Telugu 2.63M inference model / trained model 2021.1.21
kn Kannada 2.63M inference model / trained model 2021.1.21
ta Tamil 2.63M inference model / trained model 2021.1.21
mg Mongolian -- Ongoing
bg Bangla -- Need dict and corpus
vi Vietnamese -- Ongoing
bm Burmese -- Need dict and corpus
tr Turkish -- Need corpus
po polish -- Need dict and corpus
More TBC

Guideline for new language requests

If you want to request a new language support, a PR with 2 following files are needed：

In folder ppocr/utils/dict,
it is necessary to submit the dict text to this path and name it with {language}_dict.txt that contains a list of all characters. Please see the format example from other files in that folder.

In folder ppocr/utils/corpus,
it is necessary to submit the corpus to this path and name it with {language}_corpus.txt that contains a list of words in your language.
Maybe, 50000 words per language is necessary at least.
Of course, the more, the better.

If your language has unique elements, please tell me in advance within any way, such as useful links, wikipedia and so on.

Hi, thank you for the great work! I
I sent you a corpus for Vietnamese. The file was attached below.
vietnamese_dict.txt. This file gets from this research:
Download: https://github.com/VinAIResearch/dict-guided
You can evaluate on VinText dataset, text scene detection for Vietnamese, downloaded in Github.
Thank you.

@inproceedings{m_Nguyen-etal-CVPR21,
      author = {Nguyen Nguyen and Thu Nguyen and Vinh Tran and Triet Tran and Thanh Ngo and Thien Nguyen and Minh Hoai},
      title = {Dictionary-guided Scene Text Recognition},
      year = {2021},
      booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition (CVPR)},
    }

Answer 30 · 2021-08-16T20:40:31.000Z

Please add Bangla language. here are the dict and corpus:

dict
corpus

Answer 31 · 2021-09-05T09:10:57.000Z

Hi team,
Please update Vietnamese,
I'm very excited about this project,
Thanks very much

Answer 32 · 2021-09-23T09:54:28.000Z

@grasswolfs & @xmy0916
Dear already shared dict & corpus file for Bangla. please check. I have also added here.

bg_dict.txt
bg_corpus.txt

Answer 33 · 2021-10-20T18:31:42.000Z

Can I know is there Malay Language support? Malay is the main language from Malaysia.

Answer 34 · 2021-10-31T03:32:48.000Z

Suppose we have an image with texts from multiple languages. How do you approach this problem? One way is to ensemble all the languages and take the most confident one but it turns out to be very inaccurate because of confidence miscalibration. Can't we train a single recognition model for all languages or at least a couple of them? I think it will be a very helpful model for applications where we don't know the language beforehand or an image contain multiple languages.

Answer 35 · 2021-10-31T12:52:46.000Z

Suppose we have an image with texts from multiple languages. How do you approach this problem? One way is to ensemble all the languages and take the most confident one but it turns out to be very inaccurate because of confidence miscalibration. Can't we train a single recognition model for all languages or at least a couple of them? I think it will be a very helpful model for applications where we don't know the language beforehand or an image contain multiple languages.
Strong Upvotes for this opinion.

Answer 36 · 2021-12-09T10:01:09.000Z

Hi @grasswolfs, thanks so much for all the work you've put in. I've included a PR for the Amharic language, which is spoken by over 60 million people.
#4882

One potential issue is that Amharic words contain a number of prefixes and suffixes to indicate the object, number of items, tense, gender, negation and so. Thus, a single verb may morph in a number of ways that are not all included in the dictionary.

Answer 37 · 2021-12-09T10:02:14.000Z

Hi @grasswolfs, I also submitted a PR for the Tigrinya language, which is similar to Amharic and spoken by over 10 million people.
#4881

It has the same mutation issue as Amharic. Also, Arabic numerals are commonly used despite having its own numeral system.

Answer 38 · 2022-01-05T08:35:48.000Z

Hi @grasswolfs, I've submitted a PR for the Dutch language here: #5161

Answer 39 · 2022-01-10T04:02:13.000Z

嗨，你好。除了上面的那些连接，有最新的语言model吗，我看官方说支持80多种语言？

Answer 40 · 2022-03-07T10:52:27.000Z

ug_dict.txt
uyghur_corpus.txt

维吾尔语识别非常不好或者没有识别。
希望完善一下模型，非常感谢你们🙏

Answer 41 · 2022-03-07T12:23:21.000Z

你好，可以识别挪威语吗？

Answer 42 · 2022-03-07T12:33:21.000Z

你好，希望可以识别挪威语。只找到了1 In folder [ppocr/utils/dict]，没有找到 2 In folder [ppocr/utils/corpus]。ocr小白，请问怎么添加这两个文件呢？

Answer 43 · 2022-03-23T03:03:14.000Z

嗨，你好。除了上面的那些连接，有最新的语言model吗，我看官方说支持80多种语言？

all multilingual models can be found here

Answer 44 · 2022-04-08T09:05:28.000Z

Is there any tutorial on how can I train my own model out of my own corpus and sample images?

Answer 45 · 2022-04-08T10:43:25.000Z

Is there any tutorial on how can I train my own model out of my own corpus and sample images?

Thanks for the attention, the multilingual model training tutorial will be released next week!

Answer 46 · 2022-04-16T01:57:28.000Z

Hi team,
Thank you so much for the great work. I'm very excited about the vietnamese dict anf corpus and models, could you please update vietnamese language soon ?
Again, thankyou so much and congrats on great work

Answer 47 · 2022-05-14T13:56:24.000Z

I can help for these languages:
Turkish -- tr
Azerbaijani -- az
Faris -- fa
Afghani -af

Answer 48 · 2022-05-16T05:58:15.000Z

how I can implement multi-language like English, Urdu, and Tamil in one paddle-OCR instancE with python

Answer 49 · 2022-06-13T07:53:06.000Z

Please add Thai language, appreciate!

thai_dict.txt

thai_corpus.txt
e

Answer 50 · 2022-06-26T09:59:42.000Z

how about lao character
I can help.....

Answer 51 · 2022-09-16T07:50:56.000Z

Please add Thai language, appreciate!

thai_dict.txt

thai_corpus.txt e
Here are you:
Thai dictionany file:
https://raw.githubusercontent.com/JaidedAI/EasyOCR/master/easyocr/dict/th.txt
Thai corpus:
http://web-corpora.net/ThaiCorpus/texts_tagged.zip

Answer 52 · 2022-10-09T11:12:54.000Z

Please add Greek Language (Modern Greek Language)

Greek dictionary txt file with 144000 words
greek_dict.txt

Greek corpus txt file
Greek_corpus.txt

Answer 53 · 2022-10-14T00:39:47.000Z

Dear @D-DanielYang I have added Mongolian characters with special characters on this:

, + - * / \ ? % _ . : ₮ " - № =

Pull request #7930

https://upload.wikimedia.org/wikipedia/commons/5/54/Mongolian_keyboard_win.png

Answer 54 · 2022-10-14T07:46:04.000Z

I have add the Vietnamese dict and corpus #7933

Please help in the training for Vietnamese and if you need more information please let me know

Answer 55 · 2022-10-25T07:22:05.000Z

@Evezerest @tink2123 @D-DanielYang Please Lithuanian language urgently needed both in PaddleOCR and PPStructure

https://en.wikipedia.org/wiki/Lithuanian_language

Answer 56 · 2022-11-09T11:15:16.000Z

Hi I tried to run paddleOCR on an image with ← → ↑ ↓
except arrows everything is coming correctly. Except the font in red color

Kindly advise how to work on this piece

Answer 57 · 2023-01-25T10:27:37.000Z

Please add Tajik Language
tajik_corpus.txt
tajik_dict.txt

Answer 58 · 2023-03-13T16:42:57.000Z

@fcakyon @D-DanielYang @xmy0916

I would like to contribute to Bangla Dictionary and Corpus. Can I do that?

Also, I have a few queries to ask -

Could not clearly understand this line - If your language has unique elements, please tell me in advance within any way, such as useful links, wikipedia and so on.
In the corpus of at least 50000 words, I am guessing all of them should be unique. Am I right?
Is there any particular category of corpus words? Like except stop words or something similar to that?
In ppocr/utils path I can not see any corpus directory.

Thanks in advance

Answer 59 · 2023-04-12T03:14:02.000Z

Please add Indonesia (id) and English (en) together

Answer 60 · 2023-04-30T00:44:46.000Z

Do you have any plan for Vietnamese release?

Answer 61 · 2023-05-19T18:58:58.000Z

Is it sufficient to change the file german_dict.txt if one wants to detect Fraktur a historic german script instead of the current script form? The dictionary which was learnt for the German language should be the same? For tesseract there is one trained file for Fraktur to ocr scan historic documents.

Answer 62 · 2023-06-03T14:26:29.000Z

need indonesian language please

Answer 63 · 2023-07-23T21:55:34.000Z

Hi Dear plz add the bangla and english support. I have attach both the file for bangla
bangla_dict.txt

bangla_corpus.txt

Answer 64 · 2023-07-27T19:50:47.000Z

Hi team. Great work on Paddle, it's an amazing OCR engine! Can we please have Hebrew support in multilanguage models ?

Thanks !

Answer 65 · 2023-07-29T13:34:02.000Z

Dear Team, Tnx for your reply. I am from Bangladesh. I have already submitted both files like dict and corpus for bangla. I would appreciate if you could add bangla support. Thank you. Zahir

…

On Fri, Jul 28, 2023, 1:50 AM Edward Li ***@***.***> wrote: Hi team. Great work on Paddle, it's an amazing OCR engine! Can we please have *Hebrew* support in multilanguage models ? Thanks ! — Reply to this email directly, view it on GitHub <#1048 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD6CAOC6MTVJDVXY4W65TWDXSLBCHANCNFSM4TCPRJ6Q> . You are receiving this because you commented.Message ID: ***@***.***>

Answer 66 · 2023-08-11T08:58:54.000Z

Can you provide for any ancient scripts?

Answer 67 · 2023-08-31T08:50:19.000Z

Truong

I'm trying with my private data, but the result very poor

Answer 68 · 2023-09-21T07:45:09.000Z

Sorry for my stupid question, I am novice at DL: What difference between Inference model and trained model?

Answer 69 · 2024-01-03T02:34:54.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

Answer 70 · 2024-01-11T20:28:00.000Z

I created a PR for Bangla

Answer 71 · 2024-01-22T19:57:58.000Z

Does this list contain the latest models? If i want to fine tune for example german model do i use this link from this page to download the pretrained model? If so what yml file should i use? How do i know what is the architecture of these models?

Answer 72 · 2024-03-01T06:00:32.000Z

Please add Tajik Language
tajik_corpus.txt
tajik_dict.txt