gxrxrdx/tesseract-ocr

Selected text not visible in some readers.

Opened this issue · 6 comments

What steps will reproduce the problem?
1. Use tesseract to make a searchable PDF from a document image.
2. Open the PDF in evince, the default PDF reader in Ubuntu.
3. Click and drag on text to select it.

What is the expected output? What do you see instead?

Highlighted text is not visible. In the case of evince, it simply appears as a 
black bar. Otherwise, the text can be searched, copied and pasted as usual. 

Apparently this is not an issue with all viewers, but it is both with evince 
and also with PDFview mode for Emacs. In both of these viewers, PDFs which were 
OCR-ed with Acrobat work fine. 

If the PDF is processes with Ghostscript, it will produce also many warnings 
like this: "GPL Ghostscript 9.10: Missing glyph CID=0, glyph=0076 in the font 
GlyphLessFont . The output PDF may fail with some viewers."

What version of the product are you using? On what operating system?

tesseract 3.03
 leptonica-1.70
  libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0

This is considered up to date on Ubuntu 14.04.

Other info:

I am attaching an example PDF and a screenshot of what it looks like when text 
is selected in Evince 13.10.3. This problem was previously reported as an issue 
in the wrapper script pdfsandwich: https://sourceforge.net/p/pdfsandwich/bugs/6/

Original issue reported on code.google.com by brian....@gmail.com on 16 Mar 2015 at 7:22

Attachments:

Please see Comment 6 over here for further information: 
http://bugs.ghostscript.com/show_bug.cgi?id=695869

Original comment by brian....@gmail.com on 16 Mar 2015 at 2:14

I am the author of this feature, and did much of my testing with evince. This 
is known behaviour. The font itself is doubly invisible; it contains no glyphs 
and it is drawn with "invisible ink". Evince is inverting the invisible font 
and drawing a solid bar. Depending on your setup this will either be solid 
black or perhaps solid orange. As you mentioned, other viewers including Adobe 
Reader and the PDF viewer built into Chrome give good results.

I view this as a deficiency in evince for handling invisible fonts overload on 
a image (which is a natural representation for OCR results). I do not believe 
this is a problem with the PDF itself, or the program that it is generating it 
(Tesseract).

Recommend filing a feature request with evince.

Original comment by breidenb...@gmail.com on 20 Mar 2015 at 9:51

I now see a lot of complaints about the embedded font on the Ghostscipt bug, so 
am switching my attention over to there.

Original comment by breidenb...@gmail.com on 20 Mar 2015 at 9:54

I've been reading along with the discussion over on the Ghostscript bug. While 
most of it is way over my head, I take it that it could be a while before this 
is resolved. 

I wonder, would it be trivial to fix this issue in a temporary fork of 
Tesseract without support for non-Latin characters? If so, I would definitely 
be interested in using such a version in the meantime.

Original comment by brian....@gmail.com on 26 Mar 2015 at 6:35

Ray committed some code yesterday that seems to deal with this.

Original comment by joregan on 13 May 2015 at 11:44

Okay, so this update has nothing to do with Evince and 
highlighting.

There was a compatibility problem with ghostscipt, though.
This is resolved in the current source tree. Credit 
goes to Ken Sharp. He designed a new invisible font that 
removes this compatibility problem. I lost my password to 
the ghostscript bug tracking system so I cannot report the 
problem resolved there. Read all about it here.

https://code.google.com/p/tesseract-ocr/source/browse/api/pdfrenderer.cpp#19

PS. I vaguely remember that Ken said Ghostscript still has some issues
with certain documents, but that the Tesseract PDF files are now 100% valid
as far as he is concerned. So if there is any work left, it is on his side.








Original comment by breidenb...@gmail.com on 10 Jun 2015 at 6:29