Parsing non-UTF-8 pages
edevil opened this issue · 3 comments
edevil commented
Parsing pages not written in UTF-8 currently produces errors:
> %HTTPoison.Response{body: body} = HTTPoison.get!("http://manybooks.net/index.xml")
> Html5ever.parse(body)
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 4070 }', src/libcore/result.rs:859
note: Run with `RUST_BACKTRACE=1` for a backtrace.
{:error, "called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 4070 }"}
In this case this XML feed has the encoding in the xml preeamble:
<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
...
Can I get around this problem or can the library be fixed to handle this situation?
mischov commented
I'll leave the broader question of "can the library be fixed to handle this situation?" to Hans, but-
Can I get around this problem
Yeah, to some definition of get around.
body
|> Codepagex.to_string!(:iso_8859_1)
|> Html5ever.parse()