tesseract-ocr/tesseract

Incorrect character bounding boxes

Opened this issue · 7 comments

Environment

  • Tesseract Version: 4.1.1 / 5.0.0 α
  • Commit Number: 5.0.0-alpha-781-gb19e3ee
  • Platform: Mac OS X 10.9.5 (not one of the 3 most recent versions, but I have no reason to believe that the issue is related to my OS)

tesseract --version for both builds:

tesseract 4.1.1
 leptonica-1.80.0
  libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
tesseract 5.0.0-alpha-773-gd33ed
 leptonica-1.80.0
  libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found OpenMP 201307
 Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
 Found libcurl/7.72.0 OpenSSL/1.1.1g zlib/1.2.11 libidn2/2.3.0 libpsl/0.21.1 (+libidn2/2.3.0)

Current Behavior:

Character bounding boxes are unreliable, sometimes capturing (parts of) the previous character(s) or even missing its associated character completely.

Expected Behavior:

The bounding box should at least contain its associated character and overlap only in cases where the characters themselves overlap.

Suggested Fix:

Adjust how much bounding boxes can overlap. Maybe implement an option to force the x_min value of a character box to be no less than the x_max value of the previous bbox.


Details

I ran Tesseract 4.1.1 (on Mac OS X 10.9.5, installed through MacPorts) on a scanned page (grayscale JPEG) using the following command:

tesseract INPUT.jpg OUTPUT -c hocr_char_boxes=1 -c tessedit_create_hocr=1 -l nor --oem 1 makebox

My plan was to write a script that extracts individual characters and sorts them by symbol (for individual processing). I therefore hoped to make use of the new hOCR character bounding box support introduced in 4.1.0, but quickly ran into problems: while the OCR result itself was near perfect, Tesseract sometimes produced unexpected character bounding boxes.

To investigate the issue, I wrote a quick Python script that uses .box files produced by Tesseract to extract the individual characters and assemble an image strip with the OCRed character printed below the character bounding box.

Consider the following sample (A):

A

Tesseract 4.1.1 produces the following .box file (truncated to the first three words):

n 50 151 65 167 0
ø 68 150 84 167 0
y 70 144 103 171 0
t 87 144 113 171 0
r 116 151 127 167 0
u 129 151 145 167 0
m 150 151 175 167 0
- 191 151 200 172 0
e 191 158 199 161 0
t 201 151 226 172 0
e 239 151 251 175 0
l 239 151 253 167 0
. 256 151 270 175 0
n 288 151 304 167 0
ø 306 150 323 168 0
y 309 145 337 172 0
t 325 145 341 167 0
r 342 151 352 172 0
e 355 151 365 167 0
t 366 151 391 172 0

From this my script produced the following image:

strip_A

There are several overlapping bboxes, some including (parts of) other characters and a few even missing their associated character completely.

Reading through the similar issue reports that I could find, I learned that the LSTM engine does not actually output bounding boxes, but rather a simple x coordinate per character and that Tesseract then tries to create a bounding box from it.

I assume that this explains why the bboxes sometimes extend past the character they belong to and in some cases even overlap with other bboxes. However, it do not see how that can make a bbox completely miss its associated character, even though it was correctly OCRed, like in these five cases:

strip_A+prob

When I extracted these three words and added some white background to produce this image … (B):

B

… the results also changed slightly (once again I have drawn red rectangles around the cases where a bbox captures the wrong character):

strip_B+prob

I have no idea why – all images in this test are saved in .png format, so compression artifacts should not be an issue here.

When I tried the legacy engine the bboxes were correct, but the accuracy dropped. From what I read that is expected due to how the legacy engine works (I assume that it is based on matching individual characters).

Since the release of 4.1.1, some improvements seem to have been made, but I was unable to find anything specific in the commit history, so it might be random. I compiled the latest revision (version 5.0.0 alpha) and ran the same commands as above. This time the following two images were produced:

For A:
strip_A_n

For B:
strip_B_n

The bboxes are more accurate than with version 4.1.1, but there are still problems (and they are the same as with 4.1.1). Summary of the problems:

  • When a character is affected, the error “accumulates” and subsequent characters are usually affected too. The algorithm will almost never “recover” before the word ends once it has started producing incorrect bboxes (exception: compare the last 4-5 characters in the two last images).
  • The worst case is that a bbox captures the previous character, often perfectly – so far I have not seen any cases where the bbox of character n contains parts of character n-2. If it completely misses character n, it will capture all of character n-1 and only that.
  • A chain of errors always ends on the (detected) word boundary.
  • The last character bbox in a word always captures its associated character plus any leftovers from the previous character(s) if they were affected by the problems.

Based on this I assume that the engine is identifying words, not characters, and subsequently attempts to split each identified word into separate characters. It looks like Tesseract does not check if a calculated character bbox is overlapping with other bboxes, but perhaps it should (or at least have an option to)?

Same here:

Mac OS + tesseract 5.0.0-alpha-773-gd33ed
Debian + tesseract 4.0.0
Debian + tesseract 5.0.0-alpha

harpocrates_1213_0367 improved

harpocrates_1213_0367 improved bw

# .box
H 0 0 162  162 0 # larger without reason
A 171  6 294   171 0 # y2 is bottom of image
R 302 12 407 160 0 # OK
P 403 9 480 160 0 # OK
O 482 12 572 156 0 # OK
C 581 12 677  148 0 # shifted to top; y1 should be 22, y2 158;  takes y1 from O and adds the height?
R 698 0 798 171 0 # crazy
A 691 9 909 151 0 # x1 is smaller x1 of R, and the left edge of R
T 885 15 1034 160 0 # ~OK
E 1046 12 1154 159 0 # OK
S 1168 12 1270 157 0 # OK

The image annotated with bounding boxes (red) and baseline (green):

harpocrates_1213_0367 boxes

Smells like an error by one, as it takes values from the previous character: H stars with 0,0; A takes x1, y1 from the previous R etc.

IMO: if you need accurate bounding boxes (on character level), you need to use legacy engine
(e.g. tesseract INPUT.jpg OUTPUT -l nor --oem 0 makebox).
eng train exp1_boxed
There will be still some issues (e.g. with italic text), but boxes are more accurate.

3105b_boxed

@zdenop Thx for the tip. I only played with --psm (page segmentation mode).

I wouldn't expect another result as A and T overlap (negativ "kerning").

@zdenop Your --oem tip works great for highlighting the bounding boxes e.g. using openCV. There, the bounding boxes are very exact. Unfortunately, I note huge differences between this box output and the actual processed searchable PDF of tesseract. The searchable PDF contains too small character bounding boxes, regardless of --oem. In contrast, the whitespace is very large since the beginning of the words are always valid.
I am wondering about the differences in the .box file and the pdf. Is there anything I was missing or is this a bug?

tesseract tesseract_example.jpg OUTPUT -l deu --oem 0 pdf
Tesseract Open Source OCR Engine v5.0.0-alpha.20201127 with Leptonica 
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 364

tesseract tesseract_example.jpg OUTPUT -l deu --oem 0 makebox
Tesseract Open Source OCR Engine v5.0.0-alpha.20201127 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 364

files:
example input jpg
output pdf

previews:
Bounding boxes from tesseract, added using openCV
grafik

Wrong character bounding boxes in tesseract pdf output. Highlighted word: eine
grafik

I am afraid you are expecting something from tesseract that was never promised/expected. PDF viewer is selecting text and not its bounding box.

Tesseract creates „text layer“ with glyphless font – this will never fit to image background exactly: width of “eine” is different when different font is used (Helvetica/Times/Garamond…). Simply: tesseract cannot use the same font as font used on image.

That's why each character has the same size in the pdf (the selection box of 'n' is the same as of 'i'). Thank you very much for your clarification.

rmast commented

I have done the same with some other example, and measured the issues.
I used
pdftohtml -c -hidden -xml outputbaserobert.pdf output.xml
to export the bounding boxes of the PDF-textlayer to compare them to the visual issues.
When I calculate the width of the bounding boxes they rather match the bounding boxes in a DjVu from gscan2pdf that does show the selection boxes right in WinDjView.

However there still is a significant visual difference when watching the text-selection in the PDF. I'm afraid the issue is just as clarified above, and the only way to get it matching would be uploading some reconstructed font into the PDF as Acrobat Pro does.