Parsing non-UTF-8 pages

Question

Parsing non-UTF-8 pages

edevil opened this issue 8 years ago · 3 comments

Parsing pages not written in UTF-8 currently produces errors:

> %HTTPoison.Response{body: body} = HTTPoison.get!("http://manybooks.net/index.xml")
> Html5ever.parse(body)

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 4070 }', src/libcore/result.rs:859
note: Run with `RUST_BACKTRACE=1` for a backtrace.
{:error, "called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 4070 }"}

In this case this XML feed has the encoding in the xml preeamble:

<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
...

Can I get around this problem or can the library be fixed to handle this situation?

Answer 1 · 2017-05-31T17:27:55.000Z

I'll leave the broader question of "can the library be fixed to handle this situation?" to Hans, but-

Can I get around this problem

Yeah, to some definition of get around.

body
|> Codepagex.to_string!(:iso_8859_1)
|> Html5ever.parse()

Answer 2 · 2017-06-01T11:22:43.000Z

Thanks, @mischov!

Answer 3 · 2017-06-01T11:45:11.000Z

Going to keep this open, I would still like to find a proper solution for this.

As far as I can tell, html5ever does not support detecting encoding yet. See this issue.