This is a patch of Lua that allows you to turn on UTF-8 identifiers by defining ALLOW_UTF8_IDENTIFIERS
at compile time. This allows fun identifiers like π = math.pi
or φ = (1 + math.sqrt(5)) / 2
. This required modifying lctype.c, lctype.h, and llex.c and adding new header, unicodeid.h.
The first step is to treat the bytes 0x80-0xBF and 0xC2-0xF4 as alphabetic and saves them in a buffer like the usual Lua identifier characters ("[%a_][%w_]+"
). This is similar to an existing approach to allowing "Unicode identifiers", but without any further steps, you can have crazy things like whitespace identifiers (not to mention invalid UTF-8)!
To prevent that, if the buffer contains non-ASCII bytes, the lexer then validates the potential identifier. It calls a function that steps through the potential identifier, decodes byte sequences into code points, and checks that the code points are allowed in an identifier.
The code points are compared to two arrays of ranges of code points. One array contains code points with the XID_Start property, which are allowed at the beginning of an identifier (like ASCII alphabetic characters and underscore in vanilla Lua); the other contains code points with the XID_Continue property, which include XID_Start code points as well as code points that can only appear after the first code point (like ASCII digits in vanilla Lua). This is a fairly complete way of defining Unicode identifiers and is used by Rust and Python 3 for instance. Python internally converts identifiers to NFKC (Normalization Form Compatibility Composition) and Rust to NFC (Normalization Form Composition), which requires more Unicode data. See Unicode® Standard Annex #31: Unicode Identifier and Pattern Syntax § Default Identifiers for more information.
If an identifier is not valid UTF-8 or contains a code point that is not allowed, an error is thrown at compile time. The lexer only validates the encoding of identifiers; strings and comments can still contain invalid UTF-8.
The validation prohibits whitespace characters like an en quad (U+2000) in identifiers (Lua list post). But no Unicode normalization is performed (related Lua list post), so, confusingly, á
(U+00E1, Latin small letter a acute) is a different identifier from á
(U+0061, U+0301: Latin small letter a, combining acute accent) and 한
(Hangul syllable han) is different from 한
(U+1112, U+1161, U+11AB: Hangul choseong hieuh, Hangul jungseong a, Hangul jongseong nieun), even though they look the same, if you have fonts that support the characters. And identifiers that mix several different scripts aren't screened out: functiοn
is a valid identifier and is different from the keyword function
because it has a Greek small letter omicron (U+03BF) in place of Latin o (U+006F).
The validation functions are found in a custom header. They use data that is generated by a Lua script from DerivedCoreProperties.txt. The current version is based on Unicode 11.0.
To do: allow zero-width joiner and zero-width non-joiner in limited circumstances (Public Review Issue #96)? Background: http://unicode.org/review/pr-37.pdf