manisandro/gImageReader

Problem Recognizing Vertically Oriented Text

Closed this issue · 9 comments

Prior to and including version 3.0.1, gImageReader was able to OCR vertically oriented text (e.g. text from old Chinese/Japanese novels where they were printed vertically). A bug seemed to have been introduced starting from version 3.1 and I was no longer able to OCR such text. The program will just insist on OCR them as horizontal oriented text. Appreciate if you could look into this problem, thanks.

Sorry I've mixed up the version. The bug seemed to have been introduced since version 3.0, and the program was working properly at version 0.9

Hello

There isn't anything I explicitly changed with regard to this between 0.9 and 3.x.. However it should be possible to actually handle this correctly. Do I understand you correctly that your issue is that recognizing vertical text returns one character per line, instead of it being "flattened out" onto a single line?

I was trying to recognize an entire block of text, with the Chinese characters arranged in a vertical fashion, meaning the lines of text are vertical. I tried Tesseract command line with the default "-psm 3" param and it was recognized properly. Let's say the text is like this (example 1):

A O
p r
p a
l n
e g
e

Version 0.9 and Tesseract command line will return correctly as follow:
Apple
Orange

But with 3.x I'll get the vertical text as (example 1) above.

Btw the vertical text in (example 1) should line up as 2 vertical lines...

Ok I'll look into it.

Thanks. Attached is a sample text (in Chinese) if it was any help. The Tesseract command line to recognize it is:
tesseract text,jpg test -l chi_tra

test1

Hello

Is

他馬上精砷起來
_走到了廳中 。

果然不出所料'
很迷人。

the expected output for this?

yes the result is correct

This will be fixed in the upcoming release (probably by the end of the month).