Parsing HTML not in UTF-8
Closed this issue · 2 comments
edevil commented
I'm trying to parse an HTML page (a RSS feed in this case) that is not in UTF-8. The XML preamble specifies:
<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
...
Parsing this (example in Elixir):
%HTTPoison.Response{body: body} = HTTPoison.get!("http://manybooks.net/index.xml")
:mochiweb_html.parse(body)
...
{"description", [],
[<<60, 112, 62, 84, 104, 101, 32, 77, 121, 116, 104, 32, 72, 117, 110,
116, 101, 114, 32, 66, 111, 111, 107, 32, 49, 32, 60, 47, 112, 62,
...>>]}, {"pubdate", [], ["Thu, 09 Jul 15 02:37:08 -0600"]},
...
we get some stuff that is not decoded correctly. Is there someway for mochiweb to return the correct results or do I have to always pass UTF-8 encoded values to it?
etrepum commented
mochiweb_html only supports UTF-8 input, you'll have to deal with any
encodings specified in xml processing instructions beforehand.
You could use mochiweb_html:tokens(Body) to check to see if there's a
processing instruction that specifies the encoding and then re-parse if
necessary. Will work for codecs that are a superset of ASCII.
…On Wed, May 31, 2017 at 8:44 AM, André Cruz ***@***.***> wrote:
I'm trying to parse an HTML page (a RSS feed in this case) that is not in
UTF-8. The XML preamble specifies:
<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
...
Parsing this (example in Elixir):
%HTTPoison.Response{body: body} = HTTPoison.get!("http://manybooks.net/index.xml")
:mochiweb_html.parse(body)
...
{"description", [],
[<<60, 112, 62, 84, 104, 101, 32, 77, 121, 116, 104, 32, 72, 117, 110,
116, 101, 114, 32, 66, 111, 111, 107, 32, 49, 32, 60, 47, 112, 62,
...>>]}, {"pubdate", [], ["Thu, 09 Jul 15 02:37:08 -0600"]},
...
we get some stuff that is not decoded correctly. Is there someway for
mochiweb to return the correct results or do I have to always pass UTF-8
encoded values to it?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#187>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABn5DdXzWa-mD1a0oyuo2nALUYuyVHwks5r_Yr7gaJpZM4Nr0vT>
.
edevil commented
Ok, thanks!