kbrw/sweet_xml

Unicode problems with xmerl

pboehm opened this issue · 3 comments

First I want to thank you for this really useful library, but I run into an error when I pipe text containing non-ASCII characters into the xpath/2 function. I'm unable to resolve this problem so I hope that you have an idea to fix this.

Interactive Elixir (1.0.0) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> import SweetXml
nil
iex(2)> "<title>Hallöchen</title>" |> xpath(~x"//title/text()")
3414- fatal: {error,{wfc_Legal_Character,{error,{bad_character,246}}}}
** (exit) {:fatal, {{:error, {:wfc_Legal_Character, {:error, {:bad_character, 246}}}}, {:file, :file_name_unknown}, {:line, 1}, {:col, 14}}}
    xmerl_scan.erl:4102: :xmerl_scan.fatal/2
    xmerl_scan.erl:2703: :xmerl_scan.scan_char_data/5
    xmerl_scan.erl:2615: :xmerl_scan.scan_content/11
    xmerl_scan.erl:2128: :xmerl_scan.scan_element/12
    xmerl_scan.erl:570: :xmerl_scan.scan_document/2
    xmerl_scan.erl:286: :xmerl_scan.string/2
    lib/sweet_xml.ex:133: SweetXml.parse/1
    lib/sweet_xml.ex:177: SweetXml.xpath/2

Hi pboehm,

Sadly, it looks like this is a limitation of xmerl.
See below the snippet from the introduction of the xmerl user guide.
http://www.erlang.org/doc/apps/xmerl/xmerl_ug.html

It seems surprising really, given Erlang's European roots.
You may want to try the erlang mailing list or perhaps the elixir one.

There are two known shortcomings in xmerl:

It cannot retrieve external entities on the Internet by a URL reference, 
only resources in the local file system.

xmerl can parse Unicode encoded data. But, it fails on tag names, attribute names 
and other mark-up names that are encoded Unicode characters not mapping on ASCII.

I guess as a last resort you could substitute an ascii character?
I know that sucks, so let us know if you find a better solution.

Good Luck,

-doug.

Thanks for your reply.

I've already read about these limitations. My solution was not using Elixir for solving the problem ... ;-(

I will close this issue

I came up with a really crazy workaround to remove unprintable characters and then put them back.

Though you could just filter out unprintable characters if you don't need them at all. See: https://angelika.me/2017/07/11/print-my-string-elixir/

def replace_unprintable(str) do
  str
  |> String.codepoints()
  |> Enum.map(fn c ->
    if String.printable?(c) do
      c
    else
      "$0x#{Base.encode16(c, case: :lower)};"
    end
  end)
  |> Enum.join("")
end

def revert_unprintable(str) do
  re = Regex.compile!("\\$0x([0-9a-f]{2});")

  Regex.split(re, str, include_captures: true)
  |> Enum.map(fn s ->
    m = Regex.run(re, s)

    if m do
      List.last(m)
      |> Base.decode16!(case: :lower)
    else
      s
    end
  end)
  |> Enum.join("")
end