Page#text does not return all the text

Question

Page#text does not return all the text

Opened this issue a year ago · 3 comments

For some reason PDF::Reader#text does not return all the text on a PDF file I'm scanning. Albeit I'm able to get the text by looking at the runs directly. Here is the file: https://hacktivista.org/tmp/2700968.pdf

The text I'm unable to get through #text is LECTURA ACTUAL 15-MAY-2023

Answer 1 · 2023-06-16T18:01:54.000Z

For the time being I just monkey-patched the class to add an :unformatted option. I'll leave it here:

require 'pdf/reader'

module PDF
  class Reader
    # PDF::Reader::Page monkey patches.
    class Page
      alias_method :_text, :text
      remove_method :text

      # @param [Hash] opts Adds :unformatted option.
      def text(opts = {})
        return runs.map(&:text).join(' ') if opts[:unformatted]

        _text(opts)
      end
    end
  end
end

Answer 2 · 2023-10-24T13:20:55.000Z

Had the same issue as well, looking forward to see a fix merged in the library.
In in the meantime, thanks @hacktivista for this monkey patch.

Answer 3 · 2024-04-22T17:39:46.000Z

Having the same issue!