Text passed to receiver is not UTF-8 encoded

Question

Text passed to receiver is not UTF-8 encoded

martinadamek opened this issue 3 years ago · 2 comments

Maybe I misunderstood README saying:

Regardless of the internal encoding used in the PDF all text will be converted to UTF-8 before it is passed back from PDF::Reader.

While parsing some infoices, I am receiving ASCII-8BIT (or US-ASCII) encoded string in my receiver:

def show_text(arg)
  puts arg.encoding
end

Should this be possible, or did I misunderstood the API and docs?
Btw, when I don't use receiver and check page.text, it is UTF-8 encoded, es expected.

martinadamek commented 3 years ago

Thanks!

Answer 1 · 2022-04-07T14:40:31.000Z

Apologies for the confusion.

page.text should always return utf-8 encoded text that's marked as such. The show_text callback is lower level, and it'll return the raw character codes from the PDF content stream and they are very rarely any recognisable encoding. They usually have to be converted into utf-8 via a mapping process.