justinmeza/lci

Invalid Unicode code points are accepted

Opened this issue · 0 comments

U+D800 (among others) is not a valid Unicode code point according to the UTF-8 standard. However, the LOLCODE interpreter accepts ":(D800)" without a warning/error and happily writes the incorrect code point to the screen.

Wikipedia says

U+D800 to U+DFFF

The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points.

However UCS-2, UTF-8, and UTF-32 can encode these code points in trivial and obvious ways, and large amounts of software does so even though the standard states that such arrangements should be treated as encoding errors. It is possible to unambiguously encode them in UTF-16 by using a code unit equal to the code point, as long as no sequence of two code units can be interpreted as a legal surrogate pair (that is, as long as a high surrogate is never followed by a low surrogate). The majority of UTF-16 encoder and decoder implementations translate between encodings as though this were the case and Windows allows such sequences in filenames.

In other words, it's not uncommon to accept these code points. Is this something that should/will be fixed, is it intentionally allowed, or is the behavior undefined? (I believe that the specification doesn't cover this.)