LeoFCardoso/pdf2pdfocr

Bad insertion text on PDF

FloLaco opened this issue · 3 comments

Hi

PDF source :
Module 3 - .v2.pdf

I'm trying to OCR the text on my pdf for personal use.
I've check the TXT file generated, and it's working (I'm seeing the proper text).
But when I open my PDF file (with OCR), if I search text, it does not work. If I copy/paste text from PDF, it's a weird text :

22tropnosseccaevitartsinimdatimrepdluowspuorgeerhtllA.puorgrevresnoitacilppaehtotylnonepo
erucesylhgihyolpednacuoy,msinahcemsihthtiW.krowtenetaroprocs’remotsucehtmorfylnotub .snoitacilppa

Hello @FloLaco thank you for the issue. I could reproduce.

I'll check it out. By now you can try "-f -g smart" flags as a workaround.

pdf2pdfocr -i ./Module 3 - .v2.pdf -f -g smart

@FloLaco I opened an issue in qpdf project as this looks like a bug in that project.

I also used ghostscript to try a "repair" in your source pdf file, as illustrated in https://superuser.com/questions/278562/how-can-i-fix-repair-a-corrupted-pdf-file

Page 46

The following warnings were encountered at least once while processing this file:
	encountered more q than Q

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> PDFKit <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

After the repair, pdf2pdfocr (and qpdf) worked fine.

Please consider check the structure of your source pdf file.

Confirmed qpdf bug and fixed in 11.3.0 version.
I'm closing this.
Thank you for reporting.