gxrxrdx/tesseract-ocr

Japanese characters recognition - incorrect output for some characters

Closed this issue · 2 comments

What steps will reproduce the problem?
1.Run the tesseract for the attached files below
2.
3.

What is the expected output? What do you see instead?
No error in OCR output file i.e all image characters read properly

What version of the product are you using? On what operating system?
Tesseract 3.02.02
OS: Windows 7


Please provide any additional information below.

A few hiragana characters are read in 2 blocks of characters instead of 1.
For instance

1. ぽ read as ほま

2. ぷ read as ふて

3. ぶ read as ふご

I have created traindata only for hiragana characters, just to begin with. I 
would like find a solution to this problem before I start Kanji.
Thanks for your time and support.

Original issue reported on code.google.com by sivakuma...@gmail.com on 22 Jan 2015 at 5:11

Attachments:

Attaching the result file

Original comment by sivakuma...@gmail.com on 22 Jan 2015 at 5:13

Attachments:

I am sorry, but we provide support only for language data files released by 
this project (e.g. not for custom training). hir.traineddata was not 
created/released by tesseract-ocr

Original comment by zde...@gmail.com on 7 Feb 2015 at 7:51

  • Changed state: WontFix