tesseract-ocr/langdata

Language Request: Kurdish Sorani (Central Kurdish)

makwanbarzan opened this issue · 1 comments

There's already a trained data file for the Latin dialect of the Kurdish language. Sorani dialect is the second most used dialect of the language and it'd be amazing to have a trained data file in Tesseract.

The script is Persian-like, except having a few different letters like ژ، گ، ڤ، چ، ۆ. So it shouldn't take so much effort to develop.

Thank you and I'm looking forward to getting a response.

All those characters are included in the script/Arabic model. Maybe that already works for Sorani text?