Shouldn't "\UA66D" be valid, if not a unicode sequence?
renatoathaydes opened this issue · 3 comments
The test test_parsing/n_string_unicode_CapitalU.json
considers that "\UA66D"
should fail to parse.
But I am confused by this because the spec says that any character may be escaped, hence it's not illegal to escape U
? Sure, this is not supposed to be a unicode sequence because unicode sequences use lowercase u
... but why shouldn't the parser accept this as the string UA66D
?
Would this also be illegal??
\Uzzzz
I believe the word "escaped" is being used for two different concepts:
- escaping as in
\
followed by a escaped character. - escaping as in using
\u<unicode>
.
I don't know why the same word is used in both cases though, it's just confusing.
I assume that the first variety can only be used to escape the characters explicitly mentioned in the RFC:
escape (
%x22 / ; " quotation mark U+0022
%x5C / ; \ reverse solidus U+005C
%x2F / ; / solidus U+002F
%x62 / ; b backspace U+0008
%x66 / ; f form feed U+000C
%x6E / ; n line feed U+000A
%x72 / ; r carriage return U+000D
%x74 / ; t tab U+0009
%x75 4HEXDIG ) ; uXXXX U+XXXX
All other characters MAY be escaped using the \u
notation. That I think makes sense.
I would say that calling the \u
notation "escaping" is very misleading: it's not escaping the character, it's using its encoded form... but I guess RFC authors are not known for their ability to use words unambiguously.
Hope someone can confirm my interpretation is correct.
I believe you are referring to:
Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point.
The form \uhhhh
can denote any character from U+0000 to U+FFFF. The form \U
is also a valid form (but not in json) and is followed by eight hexadecimal digits in the form \Uhhhhhhhh
which can then be broken down into two \u
sets to denote characters U+10000 to U+FFFFFFFF.
Getting back to your original question, \u
is the escape sequence for any character.
Two more notes, the first from RFC-8259:
The representation of strings is similar to conventions used in the C family of programming languages.
The second from Wikipedia:
A sequence such as
\z
is not a valid escape sequence according to the C standard as it is not found in the table above.