tesseract-ocr/langdata

Add Indic numerals and missing punctuation to Arabic

mustafa0x opened this issue · 4 comments

Previously: #71 and tesseract-ocr/tessdata_best#11 (also contains a pertinent discussion on how well the different traineddata deal with these characters).

• Indic numerals: (٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩)
• Punctuation: (؛, ،, ﴿﴾)
• Also, a ligature very commonly found in Arabic texts: ﷺ

If I can do this myself please simply point me the way.

CC @Shreeshrii

Please see tesseract-ocr/tesseract#2263 (comment)
and test if the traineddata files linked there add all the required characters.

Is this fixed? I've tried the latest version and it didn't detect any Indic numerals.

@wewark you have to use Arabic.traineddata file. It recognizes arabic, English letters and Arabic-Indic and Arabic numbers

@ShroukMansour I use ara.traindata and texts not accuracy also numbers have no accuracy . Is there a solution for this ?