Text passed to receiver is not UTF-8 encoded
martinadamek opened this issue · 2 comments
Maybe I misunderstood README saying:
Regardless of the internal encoding used in the PDF all text will be converted to UTF-8 before it is passed back from PDF::Reader.
While parsing some infoices, I am receiving ASCII-8BIT (or US-ASCII) encoded string in my receiver:
def show_text(arg)
puts arg.encoding
end
Should this be possible, or did I misunderstood the API and docs?
Btw, when I don't use receiver and check page.text
, it is UTF-8 encoded, es expected.
Apologies for the confusion.
page.text
should always return utf-8 encoded text that's marked as such. The show_text callback is lower level, and it'll return the raw character codes from the PDF content stream and they are very rarely any recognisable encoding. They usually have to be converted into utf-8 via a mapping process.
Thanks!