Unicode problems with xmerl
pboehm opened this issue · 3 comments
First I want to thank you for this really useful library, but I run into an error when I pipe text containing non-ASCII characters into the xpath/2
function. I'm unable to resolve this problem so I hope that you have an idea to fix this.
Interactive Elixir (1.0.0) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> import SweetXml
nil
iex(2)> "<title>Hallöchen</title>" |> xpath(~x"//title/text()")
3414- fatal: {error,{wfc_Legal_Character,{error,{bad_character,246}}}}
** (exit) {:fatal, {{:error, {:wfc_Legal_Character, {:error, {:bad_character, 246}}}}, {:file, :file_name_unknown}, {:line, 1}, {:col, 14}}}
xmerl_scan.erl:4102: :xmerl_scan.fatal/2
xmerl_scan.erl:2703: :xmerl_scan.scan_char_data/5
xmerl_scan.erl:2615: :xmerl_scan.scan_content/11
xmerl_scan.erl:2128: :xmerl_scan.scan_element/12
xmerl_scan.erl:570: :xmerl_scan.scan_document/2
xmerl_scan.erl:286: :xmerl_scan.string/2
lib/sweet_xml.ex:133: SweetXml.parse/1
lib/sweet_xml.ex:177: SweetXml.xpath/2
Hi pboehm,
Sadly, it looks like this is a limitation of xmerl.
See below the snippet from the introduction of the xmerl user guide.
http://www.erlang.org/doc/apps/xmerl/xmerl_ug.html
It seems surprising really, given Erlang's European roots.
You may want to try the erlang mailing list or perhaps the elixir one.
There are two known shortcomings in xmerl:
It cannot retrieve external entities on the Internet by a URL reference,
only resources in the local file system.
xmerl can parse Unicode encoded data. But, it fails on tag names, attribute names
and other mark-up names that are encoded Unicode characters not mapping on ASCII.
I guess as a last resort you could substitute an ascii character?
I know that sucks, so let us know if you find a better solution.
Good Luck,
-doug.
Thanks for your reply.
I've already read about these limitations. My solution was not using Elixir for solving the problem ... ;-(
I will close this issue
I came up with a really crazy workaround to remove unprintable characters and then put them back.
Though you could just filter out unprintable characters if you don't need them at all. See: https://angelika.me/2017/07/11/print-my-string-elixir/
def replace_unprintable(str) do
str
|> String.codepoints()
|> Enum.map(fn c ->
if String.printable?(c) do
c
else
"$0x#{Base.encode16(c, case: :lower)};"
end
end)
|> Enum.join("")
end
def revert_unprintable(str) do
re = Regex.compile!("\\$0x([0-9a-f]{2});")
Regex.split(re, str, include_captures: true)
|> Enum.map(fn s ->
m = Regex.run(re, s)
if m do
List.last(m)
|> Base.decode16!(case: :lower)
else
s
end
end)
|> Enum.join("")
end