Selected text not visible in some readers.
Opened this issue · 6 comments
GoogleCodeExporter commented
What steps will reproduce the problem?
1. Use tesseract to make a searchable PDF from a document image.
2. Open the PDF in evince, the default PDF reader in Ubuntu.
3. Click and drag on text to select it.
What is the expected output? What do you see instead?
Highlighted text is not visible. In the case of evince, it simply appears as a
black bar. Otherwise, the text can be searched, copied and pasted as usual.
Apparently this is not an issue with all viewers, but it is both with evince
and also with PDFview mode for Emacs. In both of these viewers, PDFs which were
OCR-ed with Acrobat work fine.
If the PDF is processes with Ghostscript, it will produce also many warnings
like this: "GPL Ghostscript 9.10: Missing glyph CID=0, glyph=0076 in the font
GlyphLessFont . The output PDF may fail with some viewers."
What version of the product are you using? On what operating system?
tesseract 3.03
leptonica-1.70
libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0
This is considered up to date on Ubuntu 14.04.
Other info:
I am attaching an example PDF and a screenshot of what it looks like when text
is selected in Evince 13.10.3. This problem was previously reported as an issue
in the wrapper script pdfsandwich: https://sourceforge.net/p/pdfsandwich/bugs/6/
Original issue reported on code.google.com by brian....@gmail.com
on 16 Mar 2015 at 7:22
Attachments:
- [Screenshot from 2015-03-16 15:08:46.png](https://storage.googleapis.com/google-code-attachments/tesseract-ocr/issue-1434/comment-0/Screenshot from 2015-03-16 15:08:46.png)
- output.pdf
GoogleCodeExporter commented
Please see Comment 6 over here for further information:
http://bugs.ghostscript.com/show_bug.cgi?id=695869
Original comment by brian....@gmail.com
on 16 Mar 2015 at 2:14
GoogleCodeExporter commented
I am the author of this feature, and did much of my testing with evince. This
is known behaviour. The font itself is doubly invisible; it contains no glyphs
and it is drawn with "invisible ink". Evince is inverting the invisible font
and drawing a solid bar. Depending on your setup this will either be solid
black or perhaps solid orange. As you mentioned, other viewers including Adobe
Reader and the PDF viewer built into Chrome give good results.
I view this as a deficiency in evince for handling invisible fonts overload on
a image (which is a natural representation for OCR results). I do not believe
this is a problem with the PDF itself, or the program that it is generating it
(Tesseract).
Recommend filing a feature request with evince.
Original comment by breidenb...@gmail.com
on 20 Mar 2015 at 9:51
GoogleCodeExporter commented
I now see a lot of complaints about the embedded font on the Ghostscipt bug, so
am switching my attention over to there.
Original comment by breidenb...@gmail.com
on 20 Mar 2015 at 9:54
GoogleCodeExporter commented
I've been reading along with the discussion over on the Ghostscript bug. While
most of it is way over my head, I take it that it could be a while before this
is resolved.
I wonder, would it be trivial to fix this issue in a temporary fork of
Tesseract without support for non-Latin characters? If so, I would definitely
be interested in using such a version in the meantime.
Original comment by brian....@gmail.com
on 26 Mar 2015 at 6:35
GoogleCodeExporter commented
Ray committed some code yesterday that seems to deal with this.
Original comment by joregan
on 13 May 2015 at 11:44
GoogleCodeExporter commented
Okay, so this update has nothing to do with Evince and
highlighting.
There was a compatibility problem with ghostscipt, though.
This is resolved in the current source tree. Credit
goes to Ken Sharp. He designed a new invisible font that
removes this compatibility problem. I lost my password to
the ghostscript bug tracking system so I cannot report the
problem resolved there. Read all about it here.
https://code.google.com/p/tesseract-ocr/source/browse/api/pdfrenderer.cpp#19
PS. I vaguely remember that Ken said Ghostscript still has some issues
with certain documents, but that the Tesseract PDF files are now 100% valid
as far as he is concerned. So if there is any work left, it is on his side.
Original comment by breidenb...@gmail.com
on 10 Jun 2015 at 6:29