Invalid characters in input are not properly handled in at least one case

Question

Invalid characters in input are not properly handled in at least one case

RichMorin opened this issue 6 years ago · 4 comments

I have a TOML file containing text which was pasted in from a web page. in consequence, it contains a UTF-16 right quote in a triple-quoted (''') text string. Here is a cut-down example:

Module Foo
  def toml_test() do
      str = """
  [ about ]

    verbose = '''
This is a UTF-16 right quote: "’"
    '''
  """ 

    Toml.decode(str)
  end
end

I've had no problem parsing this in Ruby (using the toml-rb Gem) and it also parses successfully on the Data Format Converter. However, when I try parsing it in Toml.decode, it crashes with a rather unhelpful nastygram:

** (EXIT from #PID<0.174.0>) shell process exited with reason: an exception was raised:
    ** (ArgumentError) argument error
        :erlang.iolist_to_binary([84, 104, 105, 115, 32, 105, 115, 32, 97, 32,
           99, 117, 114, 108, 121, 32, 115, 105, 110, 103, 108, 101, 32,
           113, 117, 111, 116, 101, 58, 32, 34, 8217, 34, 10, 32, 32])
        (toml) lib/lexer/string.ex:69: Toml.Lexer.String.lex_literal/5
        (toml) lib/lexer.ex:191: Toml.Lexer.lex/6
        (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3

If the TOML spec allows UTF-16, the parsing code should comply. If not, the code should find a way to return an appropriate status value.

Answer 1 · 2018-10-31T04:30:07.000Z

@RichMorin TOML documents must be valid UTF-8 documents, and the following is from the spec with regard to strings specifically:

All strings must contain only valid UTF-8 characters.

The toml-rb gem is, I suspect, parsing byte-by-byte, and not as UTF-8 codepoints; likewise with the other link you posted. In any case, regardless of why they don't fail during parsing, they should be returning an error, as the document is not valid TOML according to the spec.

That said, toml-elixir must have a bug in the lexer which is not handling invalid characters when parsing that style of string, so I'll update this to track that as the issue instead.

Answer 2 · 2018-10-31T08:38:41.000Z

It appears that the character in question is actually UTF-8, just not USASCII (7-bit):

$ echo "’" | od -t x1
0000000    e2  80  99  0a

This character is documented in RIGHT SINGLE QUOTATION MARK' (U+2019), so my take is that it should be legal. That said, the lexer should also be fixed to handle illegal characters more gracefully.

Answer 3 · 2018-10-31T20:37:13.000Z

@RichMorin Looks like the problem was I was using iodata_to_binary instead of chardata_to_binary, so the former was choking on UTF-8 characters when building up the final parsed string. Fix is 0.5.1 on Hex. Thanks for the bug report!

Answer 4 · 2018-11-01T00:26:37.000Z

Got it; seems to solve the problem. Thanks for the quick fix!