yob/pdf-reader

Page#text does not return all the text

Opened this issue · 3 comments

For some reason PDF::Reader#text does not return all the text on a PDF file I'm scanning. Albeit I'm able to get the text by looking at the runs directly. Here is the file: https://hacktivista.org/tmp/2700968.pdf

The text I'm unable to get through #text is LECTURA ACTUAL 15-MAY-2023

For the time being I just monkey-patched the class to add an :unformatted option. I'll leave it here:

require 'pdf/reader'

module PDF
  class Reader
    # PDF::Reader::Page monkey patches.
    class Page
      alias_method :_text, :text
      remove_method :text

      # @param [Hash] opts Adds :unformatted option.
      def text(opts = {})
        return runs.map(&:text).join(' ') if opts[:unformatted]

        _text(opts)
      end
    end
  end
end

Had the same issue as well, looking forward to see a fix merged in the library.
In in the meantime, thanks @hacktivista for this monkey patch.

Having the same issue!