Getting mangled characters / mixed words

Question

Getting mangled characters / mixed words

MJCune1 opened this issue 3 years ago · 4 comments

Hi,

I’m trying to extract text with last version of the gem (2.4.2) but I’m getting mangled characters / mixed words so I decided to re-export the file to pdf as a new option and then it’s correctly read by the gem. The fact is it’ll be great to avoid this re-exporting process for user.

Result from the original file:

[2] pry(main)> reader.pages.first.text
=> "0     8.224          60.576    537.500     RODRIGUEZ CR. Z9.99NA -9L JHON\n\n"

If I export the file to pdf again I got this result:

[4] pry(main)> reader.pages.first.text
=> "99.999.999-9 RODRIGUEZ PEREZ RONAL JHON      537.500        60.576        8.224"

I got some hint from the general information of the pdf in the OS, the encoding software for the first one is iText 4.2.0 by 1T3XT and for the exported one macOS Version 11.3.1 (Build 20E241) Quartz PDFContext but I’m not sure yet if it has something to do as I’ve been checking the encoding for both files and it’s UTF-8after being processed.

MJCune1 commented 3 years ago

Thanks!

Answer 1 · 2021-05-31T23:21:23.000Z

Hi @MJCune1,

Unfortunately debugging the issue requires a copy of the PDF. Are you able to share it, possibly via email to me directly if it's sensitive? My personal email can be found on my website, via the URL on my GitHub profile.

Answer 2 · 2021-06-01T15:23:13.000Z

Thanks! I already sent you a copy to your email.

Answer 3 · 2021-06-02T11:34:00.000Z

Thanks @MJCune1. You're in luck - the sample PDF you provided looks like it's parsed correctly by the fix in #350 that I merged just a few days ago. Spooky timing 👻

I haven't published a release with that fix yet, but I hope to do so soon. Are you able to load the gem via git in the meantime with this in your Gemfile?

gem 'pdf-reader', github: 'yob/pdf-reader'