LanguageMachines/PICCL

Tesseract produces garbage output without warning

martinreynaert opened this issue · 13 comments

We found out the hard way that new versions (since 3.05, also 4) of Tesseract produce garbage without warning. This is because the location and file names of helper files and the config file have changed.

In 3.04 the command line was:

export TESSDATA_PREFIX="/usr/share/tesseract/tessdata/"; /usr/local/bin/tesseract $doc $hocrdir/$last -l nld /usr/share/tesseract/tessdata/tools/config.hocr

For 3.05 and 4 this should be like:

export TESSDATA_PREFIX="/roaming/tesseract/local/share/tessdata/langfiles"; /roaming/tesseract/local/bin/tesseract $doc $hocrdir/$last -l nld /roaming/tesseract/local/sha
re/tessdata/configs/hocr

Note that each Linux distribution may be installing a different Tesseract version by default. LaMachine currently relies on the Linux distro for installing Tesseract.

MRE

The current PICCL implementation does not set a $TESSDATA_PREFIX at all (neither does LaMachine). As you say, it indeed relies on the globally installed tesseract from the underlying distribution, so it seems quite capable of finding its own files. The $TESSDATA_PREFIX seems only needed if installed in a non-standard location.

You found garbage output on our production system right? That one runs Tesseract 3.04.01 (so doesn't qualify under the 'new versions' you described) and it seems to find its data files just fine (tesseract --list-langs properly outputs the installed languages). It seems the cause remains unclear unless your investigation led/leads to more details?

I ran this: https://webservices-lst.science.ru.nl/piccl/MREZandvoorde/output/error.log

It produced nothing but garbage.

I cannot find a Tesseract version mentioned in this error.log nor in the output files. I think that should be remedied.

Please check again if Tesseract finds its *config file. On previous occasions where this went wrong in Tilburg, it never raised a warning or error, either. We have solved the problem there by checking and setting right the last element of its command line. This problem is still very real @ru.

I checked my set-up again:

for the language files for 3.05 I do not have (as I do in 4):

/roaming/tesseract/local/share/tessdata/langfiles

but:

/roaming/tesseract/local/share/tessdata/

The location of both's config file are as above and different from what it used to be for 3.04.

  • Whatever is wrong in the web application @ru: it looks every bit like the results we got before we sorted out that the right Tesseract found its hocr config-file in the right spot in Tilburg.

What we got was utter garbage.

I cannot find a Tesseract version mentioned in this error.log nor in the output files. I think that should be remedied.

I added tesseract version output to the error output now. (still, to check all dependencies you'd need to consult the metadata LaMachine compiles)

Please check again if Tesseract finds its *config file.
Whatever is wrong in the web application @ru: it looks every bit like the results we got before we > sorted out that the right Tesseract found its hocr config-file in the right spot in Tilburg.

I'm passing the contents of hocr.config directly (-c) to tesseract (it only contained one line which enables hocr xml output as opposed to default (text?), and that seems to work).

I have now directly tested Tesseract on Ponyland mlp01 and can confirm that it works and delivers readable output, on the basis of 1 proven serviceable tif-file:

[mreynaert@applejack:/vol/tensusers/mreynaert]$ ls -l *tif
-rw-r--r-- 1 mreynaert mreynaert 2176262 Dec  5 11:32 Zandvoorde.005.tif
[mreynaert@applejack:/vol/tensusers/mreynaert]$ tesseract Zandvoorde.005.tif Zandvoorde.005.hocr -l nld
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Page 1
[mreynaert@applejack:/vol/tensusers/mreynaert]$ ls -l *hocr
-rw-r--r-- 1 mreynaert mreynaert 58251 Dec  5 11:27 Zandvoorde.010.hocr
[mreynaert@applejack:/vol/tensusers/mreynaert]$ cat Zandvoorde.010.hocr 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name='ocr-system' content='tesseract 3.05.00dev' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
</head>
<body>
  <div class='ocr_page' id='page_1' title='image "-010.tif"; bbox 0 0 2328 3068; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 3 918 28 947">
    <p class='ocr_par' dir='ltr' id='par_1_1' title="bbox 3 918 28 947">
     <span class='ocr_line' id='line_1_1' title="bbox 3 918 28 947; textangle 90"><span class='ocrx_word' id='word_1_1' title='bbox 3 918 28 947; x_wconf 73' lang='nld' dir='ltr'><strong>D</strong></span> 
     </span>
    </p>
   </div>
   <div class='ocr_carea' id='block_1_2' title="bbox 6 1747 54 1777">
    <p class='ocr_par' dir='ltr' id='par_1_2' title="bbox 6 1747 54 1777">
     <span class='ocr_line' id='line_1_2' title="bbox 6 1747 54 1777; baseline 0.021 -1"><span class='ocrx_word' id='word_1_2' title='bbox 6 1747 54 1777; x_wconf 75' lang='nld' dir='ltr'>D.</span> 
     </span>
    </p>
   </div>
   <div class='ocr_carea' id='block_1_3' title="bbox 89 147 2327 877">
    <p class='ocr_par' dir='ltr' id='par_1_3' title="bbox 89 147 2324 499">
     <span class='ocr_line' id='line_1_3' title="bbox 94 147 2236 197; baseline -0.005 -6"><span class='ocrx_word' id='word_1_3' title='bbox 94 157 385 197; x_wconf 89' lang='nld' dir='ltr'>anders,dat</span> <span class='ocrx_word' id='word_1_4' title='bbox 419 158 564 189; x_wconf 92' lang='nld' dir='ltr'>moest</span> <span class='ocrx_word' id='word_1_5' title='bbox 600 156 652 188; x_wconf 91' lang='nld' dir='ltr'>al</span> <span class='ocrx_word' id='word_1_6' title='bbox 688 153 891 196; x_wconf 85' lang='nld' dir='ltr'>gekuist</span> <span class='ocrx_word' id='word_1_7' title='bbox 922 154 1099 185; x_wconf 90' lang='nld' dir='ltr'>worden</span> <span class='ocrx_word' id='word_1_8' title='bbox 1134 154 1219 184; x_wconf 94' lang='nld' dir='ltr'>met</span> <span class='ocrx_word' id='word_1_9' title='bbox 1255 148 1637 189; x_wconf 87' lang='nld' dir='ltr'>d&#39;hand‚hé,die</span> <span class='ocrx_word' id='word_1_10' title='bbox 1669 147 1818 180; x_wconf 94' lang='nld' dir='ltr'>witte</span> <span class='ocrx_word' id='word_1_11' title='bbox 1855 148 2116 181; x_wconf 91' lang='nld' dir='ltr'>vruchten.</span> <span class='ocrx_word' id='word_1_12' title='bbox 2159 151 2236 187; x_wconf 95' lang='nld' dir='ltr'>Ho,</span> 
     </span>

Note that this was outside LaMachine.

Something else in the PICCL work flow must be wrong. There, we get garbage output only. Note we do get output.

Thanks, that's a useful test.

Something else in the PICCL work flow must be wrong. There, we get garbage output only. Note we do get output.

You tested PICCL on the very same input TIF you mean right? Then we would indeed have proof something goes wrong in the workflow.

Note that this was outside LaMachine.

That's ok, shouldn't be a factor as it's the same tesseract.

Output from PICCL on this image is indeed confirmed to be garbage:

$ ocr.nf --inputdir tif_input --inputtype tif --language nld
N E X T F L O W  ~  version 18.10.1
Launching `/vol/customopt/lamachine16.dev/bin/ocr.nf` [sick_bell] - revision: 4b4f1d0a65
--------------------------
OCR Pipeline
--------------------------
[warm up] executor > local
[8d/de6401] Submitted process > tesseract (1)
[58/b78b79] Submitted process > ocrpages_to_foliapages (1)
[32/b8999b] Submitted process > foliacat (1)
$ head ocr_output/Zand*xml

FoLiA excerpt:

    <p xml:id="FH-Zandvoorde.005.tif.text.par_1_1">
      <t class="OCR">‚'_-I'l'I-r'I-l'l'l'l'l-Ifl'lll'I-I'I'I'ln'l'ì—II-fi-I. h—wmu-fw-uu_hqh__.lu- - f-1-l.r.lrirlll-lr-.H_u.lnup 1.Jl. _. ‚. ""-"I</t>

hOCR excerpt:

    <p class='ocr_par' dir='ltr' id='par_1_4' title="bbox 80 392 1176 424">
     <span class='ocr_line' id='line_1_7' title="bbox 80 392 1176 424; baseline 0 -8; x_size 32; x_descenders 8; x_ascenders 8"><span class='ocrx_word' id='word_1_15' title='bbox 80 392 552 416; x_wconf 57' lang='nld' dir='ltr'><strong>&quot;_—I—rLIHJI|-r&#39;l</strong></span> <span class='ocrx_word' id='word_1_16' title='bbox 584 400 1176 424; x_wconf 60' lang='nld' dir='ltr'><strong>&#39;F&#39;fle-PIIII-‘I-úlr</strong></span> 
     </span>
    </p>

Invocation was:

tesseract "Zandvoorde.005.tif" "Zandvoorde.005" -c "tessedit_create_hocr=T" -l "nld"

@martinreynaert Wait, your test report made no sense and put me on the wrong path:

Your input file is Zandvoorde.005.tif and the output you showed is Zandvoorde.010.hocr (that's a different one which also states to be produced by a DIFFERENT tesseract version than the one you ran!). The output also has an older modification timestamp than the input! In addition, there is a file Zandvoorde.005.hocr.txt which is the true tesseract output and contains the same garbage.

Conclusion: your test confirms the exact opposite of what you stated, it's not PICCL after all. and garbage output also occurs with 3.04.01 (contrary to what you state in the title). The questions remains what causes this and whether it can be solved.

Your tif image even appears to be an empty white page? I don't see anything on it at all. Do you?

(md5sum Zandvoorde.005.tif = 2921234e4d0b89ee027fc4398ce7a281)

Also, this seems a probable explanation for the gibberish in most circumstances: http://kiirani.com/2013/03/22/tesseract-pdf.html

Summary: Simple lack of resolution/sharpness

New and sane test input: https://download.anaproy.nl/philips1.pdf on PICCL prior to the above fix, confirmed garbage output (except for the header which is in a larger font, which supports the assessment in the blog post).

Unfortunately the above fix seems insufficient, there is still a fair amount of garbage with Tesseract 3.04 after the fix. With tesseract 4 things are much better though.

(Closing/expiring this issue, I think we have the benefit of time passing here as most distros are probably on Tesseract 4 now and we can forget about 3 alltogether).