Parsing HTML not in UTF-8

Question

Parsing HTML not in UTF-8

Closed this issue 7 years ago · 2 comments

I'm trying to parse an HTML page (a RSS feed in this case) that is not in UTF-8. The XML preamble specifies:

<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
...

Parsing this (example in Elixir):

%HTTPoison.Response{body: body} = HTTPoison.get!("http://manybooks.net/index.xml")
:mochiweb_html.parse(body)
...
      {"description", [],
       [<<60, 112, 62, 84, 104, 101, 32, 77, 121, 116, 104, 32, 72, 117, 110,
          116, 101, 114, 32, 66, 111, 111, 107, 32, 49, 32, 60, 47, 112, 62,
          ...>>]}, {"pubdate", [], ["Thu, 09 Jul 15 02:37:08 -0600"]},
...

we get some stuff that is not decoded correctly. Is there someway for mochiweb to return the correct results or do I have to always pass UTF-8 encoded values to it?

Answer 1 · 2017-05-31T16:48:59.000Z

mochiweb_html only supports UTF-8 input, you'll have to deal with any encodings specified in xml processing instructions beforehand. You could use mochiweb_html:tokens(Body) to check to see if there's a processing instruction that specifies the encoding and then re-parse if necessary. Will work for codecs that are a superset of ASCII.

…

On Wed, May 31, 2017 at 8:44 AM, André Cruz ***@***.***> wrote: I'm trying to parse an HTML page (a RSS feed in this case) that is not in UTF-8. The XML preamble specifies: <?xml version="1.0" encoding="iso-8859-1"?> <rss version="2.0"> ... Parsing this (example in Elixir): %HTTPoison.Response{body: body} = HTTPoison.get!("http://manybooks.net/index.xml") :mochiweb_html.parse(body) ... {"description", [], [<<60, 112, 62, 84, 104, 101, 32, 77, 121, 116, 104, 32, 72, 117, 110, 116, 101, 114, 32, 66, 111, 111, 107, 32, 49, 32, 60, 47, 112, 62, ...>>]}, {"pubdate", [], ["Thu, 09 Jul 15 02:37:08 -0600"]}, ... we get some stuff that is not decoded correctly. Is there someway for mochiweb to return the correct results or do I have to always pass UTF-8 encoded values to it? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#187>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABn5DdXzWa-mD1a0oyuo2nALUYuyVHwks5r_Yr7gaJpZM4Nr0vT> .

Answer 2 · 2017-05-31T17:01:24.000Z

Ok, thanks!