Unicode problems with xmerl
pboehm opened this issue · 3 comments
First I want to thank you for this really useful library, but I run into an error when I pipe text containing non-ASCII characters into the xpath/2
function. I'm unable to resolve this problem so I hope that you have an idea to fix this.
Interactive Elixir (1.0.0) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> import SweetXml
iex(2)> "<title>Hallöchen</title>" |> xpath(~x"//title/text()")
3414- fatal: {error,{wfc_Legal_Character,{error,{bad_character,246}}}}
** (exit) {:fatal, {{:error, {:wfc_Legal_Character, {:error, {:bad_character, 246}}}}, {:file, :file_name_unknown}, {:line, 1}, {:col, 14}}}
xmerl_scan.erl:4102: :xmerl_scan.fatal/2
xmerl_scan.erl:2703: :xmerl_scan.scan_char_data/5
xmerl_scan.erl:2615: :xmerl_scan.scan_content/11
xmerl_scan.erl:2128: :xmerl_scan.scan_element/12
xmerl_scan.erl:570: :xmerl_scan.scan_document/2
xmerl_scan.erl:286: :xmerl_scan.string/2
lib/sweet_xml.ex:133: SweetXml.parse/1
lib/sweet_xml.ex:177: SweetXml.xpath/2
Hi pboehm,
Sadly, it looks like this is a limitation of xmerl.
See below the snippet from the introduction of the xmerl user guide.
It seems surprising really, given Erlang's European roots.
You may want to try the erlang mailing list or perhaps the elixir one.
There are two known shortcomings in xmerl:
It cannot retrieve external entities on the Internet by a URL reference,
only resources in the local file system.
xmerl can parse Unicode encoded data. But, it fails on tag names, attribute names
and other mark-up names that are encoded Unicode characters not mapping on ASCII.
I guess as a last resort you could substitute an ascii character?
I know that sucks, so let us know if you find a better solution.
Good Luck,
Thanks for your reply.
I've already read about these limitations. My solution was not using Elixir for solving the problem ... ;-(
I will close this issue
I came up with a really crazy workaround to remove unprintable characters and then put them back.
Though you could just filter out unprintable characters if you don't need them at all. See: https://angelika.me/2017/07/11/print-my-string-elixir/
def replace_unprintable(str) do
|> String.codepoints()
|> Enum.map(fn c ->
if String.printable?(c) do
"$0x#{Base.encode16(c, case: :lower)};"
|> Enum.join("")
def revert_unprintable(str) do
re = Regex.compile!("\\$0x([0-9a-f]{2});")
Regex.split(re, str, include_captures: true)
|> Enum.map(fn s ->
m = Regex.run(re, s)
if m do
|> Base.decode16!(case: :lower)
|> Enum.join("")