nst/JSONTestSuite

Shouldn't "\UA66D" be valid, if not a unicode sequence?

renatoathaydes opened this issue · 3 comments

The test test_parsing/n_string_unicode_CapitalU.json considers that "\UA66D" should fail to parse.

But I am confused by this because the spec says that any character may be escaped, hence it's not illegal to escape U? Sure, this is not supposed to be a unicode sequence because unicode sequences use lowercase u... but why shouldn't the parser accept this as the string UA66D?

Would this also be illegal??

\Uzzzz

I believe the word "escaped" is being used for two different concepts:

  • escaping as in \ followed by a escaped character.
  • escaping as in using \u<unicode>.

I don't know why the same word is used in both cases though, it's just confusing.

I assume that the first variety can only be used to escape the characters explicitly mentioned in the RFC:

    escape (
              %x22 /          ; "    quotation mark  U+0022
              %x5C /          ; \    reverse solidus U+005C
              %x2F /          ; /    solidus         U+002F
              %x62 /          ; b    backspace       U+0008
              %x66 /          ; f    form feed       U+000C
              %x6E /          ; n    line feed       U+000A
              %x72 /          ; r    carriage return U+000D
              %x74 /          ; t    tab             U+0009
              %x75 4HEXDIG )  ; uXXXX                U+XXXX

All other characters MAY be escaped using the \u notation. That I think makes sense.

I would say that calling the \u notation "escaping" is very misleading: it's not escaping the character, it's using its encoded form... but I guess RFC authors are not known for their ability to use words unambiguously.

Hope someone can confirm my interpretation is correct.

I believe you are referring to:

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point.

The form \uhhhh can denote any character from U+0000 to U+FFFF. The form \U is also a valid form (but not in json) and is followed by eight hexadecimal digits in the form \Uhhhhhhhh which can then be broken down into two \u sets to denote characters U+10000 to U+FFFFFFFF.

Getting back to your original question, \u is the escape sequence for any character.

Two more notes, the first from RFC-8259:

The representation of strings is similar to conventions used in the C family of programming languages.

The second from Wikipedia:

A sequence such as \z is not a valid escape sequence according to the C standard as it is not found in the table above.