Incorrect character bounding boxes
Opened this issue · 7 comments
Environment
- Tesseract Version: 4.1.1 / 5.0.0 α
- Commit Number: 5.0.0-alpha-781-gb19e3ee
- Platform: Mac OS X 10.9.5 (not one of the 3 most recent versions, but I have no reason to believe that the issue is related to my OS)
tesseract --version
for both builds:
tesseract 4.1.1
leptonica-1.80.0
libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
tesseract 5.0.0-alpha-773-gd33ed
leptonica-1.80.0
libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE
Found OpenMP 201307
Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
Found libcurl/7.72.0 OpenSSL/1.1.1g zlib/1.2.11 libidn2/2.3.0 libpsl/0.21.1 (+libidn2/2.3.0)
Current Behavior:
Character bounding boxes are unreliable, sometimes capturing (parts of) the previous character(s) or even missing its associated character completely.
Expected Behavior:
The bounding box should at least contain its associated character and overlap only in cases where the characters themselves overlap.
Suggested Fix:
Adjust how much bounding boxes can overlap. Maybe implement an option to force the x_min value of a character box to be no less than the x_max value of the previous bbox.
Details
I ran Tesseract 4.1.1 (on Mac OS X 10.9.5, installed through MacPorts) on a scanned page (grayscale JPEG) using the following command:
tesseract INPUT.jpg OUTPUT -c hocr_char_boxes=1 -c tessedit_create_hocr=1 -l nor --oem 1 makebox
My plan was to write a script that extracts individual characters and sorts them by symbol (for individual processing). I therefore hoped to make use of the new hOCR character bounding box support introduced in 4.1.0, but quickly ran into problems: while the OCR result itself was near perfect, Tesseract sometimes produced unexpected character bounding boxes.
To investigate the issue, I wrote a quick Python script that uses .box files produced by Tesseract to extract the individual characters and assemble an image strip with the OCRed character printed below the character bounding box.
Consider the following sample (A
):
Tesseract 4.1.1 produces the following .box file (truncated to the first three words):
n 50 151 65 167 0
ø 68 150 84 167 0
y 70 144 103 171 0
t 87 144 113 171 0
r 116 151 127 167 0
u 129 151 145 167 0
m 150 151 175 167 0
- 191 151 200 172 0
e 191 158 199 161 0
t 201 151 226 172 0
e 239 151 251 175 0
l 239 151 253 167 0
. 256 151 270 175 0
n 288 151 304 167 0
ø 306 150 323 168 0
y 309 145 337 172 0
t 325 145 341 167 0
r 342 151 352 172 0
e 355 151 365 167 0
t 366 151 391 172 0
From this my script produced the following image:
There are several overlapping bboxes, some including (parts of) other characters and a few even missing their associated character completely.
Reading through the similar issue reports that I could find, I learned that the LSTM engine does not actually output bounding boxes, but rather a simple x coordinate per character and that Tesseract then tries to create a bounding box from it.
I assume that this explains why the bboxes sometimes extend past the character they belong to and in some cases even overlap with other bboxes. However, it do not see how that can make a bbox completely miss its associated character, even though it was correctly OCRed, like in these five cases:
When I extracted these three words and added some white background to produce this image … (B
):
… the results also changed slightly (once again I have drawn red rectangles around the cases where a bbox captures the wrong character):
I have no idea why – all images in this test are saved in .png format, so compression artifacts should not be an issue here.
When I tried the legacy engine the bboxes were correct, but the accuracy dropped. From what I read that is expected due to how the legacy engine works (I assume that it is based on matching individual characters).
Since the release of 4.1.1, some improvements seem to have been made, but I was unable to find anything specific in the commit history, so it might be random. I compiled the latest revision (version 5.0.0 alpha) and ran the same commands as above. This time the following two images were produced:
The bboxes are more accurate than with version 4.1.1, but there are still problems (and they are the same as with 4.1.1). Summary of the problems:
- When a character is affected, the error “accumulates” and subsequent characters are usually affected too. The algorithm will almost never “recover” before the word ends once it has started producing incorrect bboxes (exception: compare the last 4-5 characters in the two last images).
- The worst case is that a bbox captures the previous character, often perfectly – so far I have not seen any cases where the bbox of character
n
contains parts of charactern-2
. If it completely misses charactern
, it will capture all of charactern-1
and only that. - A chain of errors always ends on the (detected) word boundary.
- The last character bbox in a word always captures its associated character plus any leftovers from the previous character(s) if they were affected by the problems.
Based on this I assume that the engine is identifying words, not characters, and subsequently attempts to split each identified word into separate characters. It looks like Tesseract does not check if a calculated character bbox is overlapping with other bboxes, but perhaps it should (or at least have an option to)?
Same here:
Mac OS + tesseract 5.0.0-alpha-773-gd33ed
Debian + tesseract 4.0.0
Debian + tesseract 5.0.0-alpha
# .box
H 0 0 162 162 0 # larger without reason
A 171 6 294 171 0 # y2 is bottom of image
R 302 12 407 160 0 # OK
P 403 9 480 160 0 # OK
O 482 12 572 156 0 # OK
C 581 12 677 148 0 # shifted to top; y1 should be 22, y2 158; takes y1 from O and adds the height?
R 698 0 798 171 0 # crazy
A 691 9 909 151 0 # x1 is smaller x1 of R, and the left edge of R
T 885 15 1034 160 0 # ~OK
E 1046 12 1154 159 0 # OK
S 1168 12 1270 157 0 # OK
The image annotated with bounding boxes (red) and baseline (green):
Smells like an error by one, as it takes values from the previous character: H stars with 0,0; A takes x1, y1 from the previous R etc.
@zdenop Thx for the tip. I only played with --psm
(page segmentation mode).
I wouldn't expect another result as A and T overlap (negativ "kerning").
@zdenop Your --oem
tip works great for highlighting the bounding boxes e.g. using openCV. There, the bounding boxes are very exact. Unfortunately, I note huge differences between this box output and the actual processed searchable PDF of tesseract. The searchable PDF contains too small character bounding boxes, regardless of --oem
. In contrast, the whitespace is very large since the beginning of the words are always valid.
I am wondering about the differences in the .box file and the pdf. Is there anything I was missing or is this a bug?
tesseract tesseract_example.jpg OUTPUT -l deu --oem 0 pdf
Tesseract Open Source OCR Engine v5.0.0-alpha.20201127 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 364
tesseract tesseract_example.jpg OUTPUT -l deu --oem 0 makebox
Tesseract Open Source OCR Engine v5.0.0-alpha.20201127 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 364
files:
example input jpg
output pdf
previews:
Bounding boxes from tesseract, added using openCV
Wrong character bounding boxes in tesseract pdf output. Highlighted word: eine
I am afraid you are expecting something from tesseract that was never promised/expected. PDF viewer is selecting text and not its bounding box.
Tesseract creates „text layer“ with glyphless font – this will never fit to image background exactly: width of “eine” is different when different font is used (Helvetica/Times/Garamond…). Simply: tesseract cannot use the same font as font used on image.
That's why each character has the same size in the pdf (the selection box of 'n' is the same as of 'i'). Thank you very much for your clarification.
I have done the same with some other example, and measured the issues.
I used
pdftohtml -c -hidden -xml outputbaserobert.pdf output.xml
to export the bounding boxes of the PDF-textlayer to compare them to the visual issues.
When I calculate the width of the bounding boxes they rather match the bounding boxes in a DjVu from gscan2pdf that does show the selection boxes right in WinDjView.
However there still is a significant visual difference when watching the text-selection in the PDF. I'm afraid the issue is just as clarified above, and the only way to get it matching would be uploading some reconstructed font into the PDF as Acrobat Pro does.