yob/pdf-reader

Superscript words not being returned.

RichardsonWTR opened this issue · 3 comments

I've just created a document with LibreOffice, just typed "1st page test" and exported it to a PDF file.
The LibreOffice had automatically superscripted the 'st' letters.
Screenshot from 2021-09-08 10-58-50

The pdf-reader gem returns "1 page test".

yob commented

We're not intentionally skipping sueprscript, but depending on how they're encoded there's a few reasons why they might be missing from the output.

The mostly likely is that pdf-reader's naive "render text of different sizes onto a page of fixed width plain text characters" algorithm thinks that the st needs to be rendered in the same position as the 1 so it skips them.

Long term I'd love to improve that algorithm (it's in PDF::Reader::PageLayout, but I'm pretty short on time. If you're able to provide a copy of the PDF, I can at least take a look and confirm the root cause for you.

Thanks for your quick feedback!
Here it is @yob !

PDF test.pdf

yob commented

Yup, it's the naive algorithim in PageLayout.

If I extract the text from page 1, and inspect the value of @runs at this point:

@runs = merge_runs(OverlappingRunsFilter.exclude_redundant_runs(runs))

It looks like this:

[
  "st" w:4.641 size:7px @62.8,778.6,
  "1 page test" w:55.928 size:12px @56.8,773.9
]

It's decided that the st baseline (y==778.6) is sufficiently different to the baseline of the characters near it (y=773.9) that it's a separate text run. Once that happens, it won't render the characters over eachother on the final layout.

I'd happily accept a PR that improves the specific case of super text if you're up for it.

The test file you've provided would be perfect for a new spec in spec/integration_spec.rb. The fix may not be super easy, but you'd have to start by making this grouping by Y smarter:

runs.group_by { |char|
char.y.to_i
}.map { |y, chars|
group_chars_into_runs(chars.sort)
}.flatten.sort