Width in run elements is too small/col_count too high
Opened this issue · 0 comments
I am not sure if this is a problem with the pdf itself, but it seems like when mapping the mean_character_width
from @runs
in initialize of lib/pdf/reader/page_layout.rb that the width on some runs is extremely low(less than 1e-15) and getting the median from those results returns an abnormal number for the number of columns.
I've made a workaround in this fork and it now works for those PDFs:
kodius@6b232e9
The PDF that is causing these issues for me is this one:
dorset.pdf
Specifically pages 31 and 39-50, so those that are mostly blank or contain images.
This is how to reproduce it:
data = File.open("dorset.pdf").read
PDF::Reader.open(StringIO.new(data)) do |reader|
reader.pages.each_with_index do |page, index|
pp "page #{index + 1}"
pp page.text
end
end
It should break at page 31 with no error message given, when debugged deeply it actually fails to allocate memory because it does the following in to_s
(same as .text method) of page_layout.rb
and the col_count is simply too high.
page = row_count.times.map { |i| " " * col_count }