yob/pdf-reader

Width in run elements is too small/col_count too high

Opened this issue · 0 comments

I am not sure if this is a problem with the pdf itself, but it seems like when mapping the mean_character_width from @runs in initialize of lib/pdf/reader/page_layout.rb that the width on some runs is extremely low(less than 1e-15) and getting the median from those results returns an abnormal number for the number of columns.

I've made a workaround in this fork and it now works for those PDFs:
kodius@6b232e9

The PDF that is causing these issues for me is this one:
dorset.pdf

Specifically pages 31 and 39-50, so those that are mostly blank or contain images.

This is how to reproduce it:

data = File.open("dorset.pdf").read

PDF::Reader.open(StringIO.new(data)) do |reader|
  reader.pages.each_with_index do |page, index|
    pp "page #{index + 1}"
    pp page.text
  end
end

It should break at page 31 with no error message given, when debugged deeply it actually fails to allocate memory because it does the following in to_s(same as .text method) of page_layout.rb and the col_count is simply too high.

      page = row_count.times.map { |i| " " * col_count }