Superscript words not being returned.
RichardsonWTR opened this issue · 3 comments
We're not intentionally skipping sueprscript, but depending on how they're encoded there's a few reasons why they might be missing from the output.
The mostly likely is that pdf-reader's naive "render text of different sizes onto a page of fixed width plain text characters" algorithm thinks that the st
needs to be rendered in the same position as the 1
so it skips them.
Long term I'd love to improve that algorithm (it's in PDF::Reader::PageLayout
, but I'm pretty short on time. If you're able to provide a copy of the PDF, I can at least take a look and confirm the root cause for you.
Thanks for your quick feedback!
Here it is @yob !
Yup, it's the naive algorithim in PageLayout.
If I extract the text from page 1, and inspect the value of @runs
at this point:
pdf-reader/lib/pdf/reader/page_layout.rb
Line 20 in 8557768
It looks like this:
[
"st" w:4.641 size:7px @62.8,778.6,
"1 page test" w:55.928 size:12px @56.8,773.9
]
It's decided that the st
baseline (y==778.6) is sufficiently different to the baseline of the characters near it (y=773.9
) that it's a separate text run. Once that happens, it won't render the characters over eachother on the final layout.
I'd happily accept a PR that improves the specific case of super text if you're up for it.
The test file you've provided would be perfect for a new spec in spec/integration_spec.rb
. The fix may not be super easy, but you'd have to start by making this grouping by Y smarter:
pdf-reader/lib/pdf/reader/page_layout.rb
Lines 100 to 104 in 8557768