Bad insertion text on PDF

Question

Bad insertion text on PDF

FloLaco opened this issue 2 years ago · 3 comments

Hi

I'm trying to OCR the text on my pdf for personal use.
I've check the TXT file generated, and it's working (I'm seeing the proper text).
But when I open my PDF file (with OCR), if I search text, it does not work. If I copy/paste text from PDF, it's a weird text :

22tropnosseccaevitartsinimdatimrepdluowspuorgeerhtllA.puorgrevresnoitacilppaehtotylnonepo
erucesylhgihyolpednacuoy,msinahcemsihthtiW.krowtenetaroprocs’remotsucehtmorfylnotub .snoitacilppa

Answer 1 · 2023-02-17T11:15:14.000Z

Hello @FloLaco thank you for the issue. I could reproduce.

I'll check it out. By now you can try "-f -g smart" flags as a workaround.

pdf2pdfocr -i ./Module 3 - .v2.pdf -f -g smart

Answer 2 · 2023-02-18T13:07:06.000Z

@FloLaco I opened an issue in qpdf project as this looks like a bug in that project.

I also used ghostscript to try a "repair" in your source pdf file, as illustrated in https://superuser.com/questions/278562/how-can-i-fix-repair-a-corrupted-pdf-file

Page 46

The following warnings were encountered at least once while processing this file:
	encountered more q than Q

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> PDFKit <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

After the repair, pdf2pdfocr (and qpdf) worked fine.

Please consider check the structure of your source pdf file.

Answer 3 · 2023-02-26T13:31:58.000Z

Confirmed qpdf bug and fixed in 11.3.0 version.
I'm closing this.
Thank you for reporting.