manisandro/gImageReader

Vertical writing systems are not handled correctly in gImageReader

Opened this issue · 0 comments

Vertical writing systems can be OCRed (fairly) reliably with the tesseract command-line tool, but will get garbled characters with gImageReader by default. Horizontal writing systems are not affected.

Here are some sample images (in chi_sim, jpn, chi_sim_vert, jpn_vert respectively):

chi_sim
jpn
chi_sim_vert
jpn_vert

Here are the results using tesseract:

tesseract

(縦組み is not OCRed correctly, but that is not a big problem.)

Here is the result using gImageReader (taking jpn_vert as an example):

gimagereader

I noticed that after rotating the image 90° counterclockwise, the result will be correct:

gimagereader_rot

(and 縦組み is OCRed correctly!)

The issue has been reported in Issue #552, but it is mistakenly regarded as a bug in tessdata. Since the tesseract command-line tool can handle it correctly, it is definitely gImageReader's fault.

I'm using gImageReader 3.4.2 and tesseract 5.4.1 under Arch Linux, using the default tessdata provided by tesseract. I noticed that gImageReader says it is using tesseract 5.3.4 in the "About" dialog, so this might have something to do with the problem.