yob/pdf-reader

Sortling/layout issues when Y coordinates don't exactly match

Opened this issue · 1 comments

Hi,

We've been using an old version of this gem (1.4.1) for a little while now and we are looking to upgrade to the latest version. That upgrade broke some of our specs and when looking deeper, it seems like the logic around PageLayout changed.

It might also be bad luck, but the use of the round here (for X and Y coords) will create issues when the PDF generated the texts with slightly different y coordinates.

Below is an example:
image
In this case, the texts in those boxes/rectangles are slightly lower than the labels from that form, causing some of those texts to be generated on another line:

Claim Number:           PHNP1610102                                     Contact:
Insured:                                                                Phone:
                         Fairfield Boys Club
Address 1:                                                              Email:
                         c/o Bejo Nanni, Treasurer

Another example:
image

We could monkey patch or fork the repo to make those changes, but please see below the code that we're going to be using. I can create a PR if this repo is still well maintained. Please let me know.

PageLayout

class PDF::Reader
  class PageLayout

    def to_s
      return "" if @runs.empty?
      return "" if row_count == 0
      first_run_at_new_y = nil # remembering a previous run at a new Y coordinate

      page = row_count.times.map { |i| " " * col_count }
      @runs.each do |run|
        x_pos = ((run.x - @x_offset) / col_multiplier).round
        y_ref_run = run # line added
        if first_run_at_new_y && run.similar_y_coord?(first_run_at_new_y) # line added
          y_ref_run = first_run_at_new_y # line added
        else # line added
          first_run_at_new_y = run # line added
        end # line added
        y_pos = row_count - ((y_ref_run.y - @y_offset) / row_multiplier).round # line updated
        if y_pos <= row_count && y_pos >= 0 && x_pos <= col_count && x_pos >= 0
          local_string_insert(page[y_pos-1], run.text, x_pos)
        end
      end
      interesting_rows(page).map(&:rstrip).join("\n")
    end

  end
end

TextRun

class PDF::Reader
  class TextRun

    # def <=>(other)
    #   if similar_y_coord?(other)
    #     x <=> other.x
    #   else
    #     other.y <=> y
    #   end
    # end

    def similar_y_coord?(other, threshold = nil)
      # arbitrary logic below. It could probably safely bumped to a higher number (dividing by 2 for instance)
      threshold = threshold || [self.font_size, other.font_size].min / 3
      (self.y - other.y).abs < threshold
    end

  end
end

Thank you.

EDIT: I updated the code above to properly support for catching multiple texts which could have been drawn on the following line.

yob commented

Thanks for the well written report.

If you have the code in a fork for your own use, I'd love a PR that I can play with, check our spec suite, etc ❤️