mochi/mochiweb

Parsing HTML not in UTF-8

Closed this issue · 2 comments

I'm trying to parse an HTML page (a RSS feed in this case) that is not in UTF-8. The XML preamble specifies:

<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
...

Parsing this (example in Elixir):

%HTTPoison.Response{body: body} = HTTPoison.get!("http://manybooks.net/index.xml")
:mochiweb_html.parse(body)
...
      {"description", [],
       [<<60, 112, 62, 84, 104, 101, 32, 77, 121, 116, 104, 32, 72, 117, 110,
          116, 101, 114, 32, 66, 111, 111, 107, 32, 49, 32, 60, 47, 112, 62,
          ...>>]}, {"pubdate", [], ["Thu, 09 Jul 15 02:37:08 -0600"]},
...

we get some stuff that is not decoded correctly. Is there someway for mochiweb to return the correct results or do I have to always pass UTF-8 encoded values to it?

Ok, thanks!