yob/pdf-reader

Getting unreadable data (UTF-8 squares 50% of the time)

ScottOster opened this issue · 5 comments

Hi,

I am trying to use PDF reader to extract text from PDFs and then of course perform some operations on it.

The problem is that my application is designed to work for anyone, and so far roughly 50% of the sample PDFs do not return any data at all just squares.

My question is: Is this expected? Is there a fundamental reason why it is not possible to extract data from majority of PDFs?

All samples used have been "openable" with adobe and generated with common print to PDF etc.

Thanks in advance for any feedback. I don't mind contributing for the time :-)

yob commented

In my experience pdf-reader does a reasonable (but not perfect) text extraction from the majority of PDFs, but it does depend on the source files.

For the 50% where it doesn't work, are you able to copy paste the text from another PDF tool (acrobat,evince, preview, firefox, etc) into notepad? If you can I'd consider the pdf-reader behaviour a bug, but if you can't then maybe it's an issue with the source PDFs.

As for what the bug is... I think I'd need to see a sample file. Are any of the files online and public?

659900.pdf
23781.pdf
4500067854.pdf

Here are some of the sample files used , the ultimate goal being to extract the delivery date.

any insight greatly appreciated

yob commented

Hi @scottybigo.

I downloaded all three files and tested text extraction with pdf-reader like this:

$ ruby -Ilib bin/pdf_text ~/downloads/4500067854.pdf
$ ruby -Ilib bin/pdf_text ~/downloads/23781.pdf
$ ruby -Ilib bin/pdf_text ~/downloads/659900.pdf

In all three cases text was printed to my terminal, so I don't think there's a fundamental incompatibility between these particular files and pdf-reader.

pdf_text looks something like this:

require 'pdf/reader'

pdf = PDF::Reader.new("file.pdf")
pdf.pages.each do |page|
  puts page.text
end

Does your code look similar? Code you post a reproduction script that results in little squares?

Thanks again James , will have a look into it .

Much appreciated.