Getting unreadable data (UTF-8 squares 50% of the time)

Question

Getting unreadable data (UTF-8 squares 50% of the time)

ScottOster opened this issue 4 years ago · 5 comments

Hi,

I am trying to use PDF reader to extract text from PDFs and then of course perform some operations on it.

The problem is that my application is designed to work for anyone, and so far roughly 50% of the sample PDFs do not return any data at all just squares.

My question is: Is this expected? Is there a fundamental reason why it is not possible to extract data from majority of PDFs?

All samples used have been "openable" with adobe and generated with common print to PDF etc.

Thanks in advance for any feedback. I don't mind contributing for the time :-)

Answer 1 · 2021-01-31T11:41:16.000Z

In my experience pdf-reader does a reasonable (but not perfect) text extraction from the majority of PDFs, but it does depend on the source files.

For the 50% where it doesn't work, are you able to copy paste the text from another PDF tool (acrobat,evince, preview, firefox, etc) into notepad? If you can I'd consider the pdf-reader behaviour a bug, but if you can't then maybe it's an issue with the source PDFs.

As for what the bug is... I think I'd need to see a sample file. Are any of the files online and public?

Answer 2 · 2021-02-01T08:34:36.000Z

Hi James, Thank you so much for the prompt response. I have just sampled 3 off the files that are returning squares, and all three copied and pasted from Adobe, Microsoft and evince readers !! I have put this question to the developers. The code is not public but i'd be more than happy to share it with you privately. Thanks again

…

On Sun, Jan 31, 2021 at 11:41 AM James Healy ***@***.***> wrote: In my experience pdf-reader does a reasonable (but not perfect) text extraction from the majority of PDFs, but it does depend on the source files. For the 50% where it doesn't work, are you able to copy paste the text from another PDF tool (acrobat,evince, preview, firefox, etc) into notepad? If you can I'd consider the pdf-reader behaviour a bug, but if you can't then maybe it's an issue with the source PDFs. As for what the bug is... I think I'd need to see a sample file. Are any of the files online and public? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#345 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AR6UFDZLIADRZLVVLNXNJSDS4U6WRANCNFSM4W3E6XPA> .

-- Scott Oster Artificial Ingenious LTD T: 07794708828

Answer 3 · 2021-02-03T00:43:02.000Z

659900.pdf
23781.pdf
4500067854.pdf

Here are some of the sample files used , the ultimate goal being to extract the delivery date.

any insight greatly appreciated

Answer 4 · 2021-02-04T12:42:02.000Z

Hi @scottybigo.

I downloaded all three files and tested text extraction with pdf-reader like this:

$ ruby -Ilib bin/pdf_text ~/downloads/4500067854.pdf
$ ruby -Ilib bin/pdf_text ~/downloads/23781.pdf
$ ruby -Ilib bin/pdf_text ~/downloads/659900.pdf

In all three cases text was printed to my terminal, so I don't think there's a fundamental incompatibility between these particular files and pdf-reader.

pdf_text looks something like this:

require 'pdf/reader'

pdf = PDF::Reader.new("file.pdf")
pdf.pages.each do |page|
  puts page.text
end

Does your code look similar? Code you post a reproduction script that results in little squares?

Answer 5 · 2021-02-07T11:11:25.000Z

Thanks again James , will have a look into it .

Much appreciated.