
Unsupport Unicode

Opened this issue · 1 comments

Supporting Unicode in Tact grammar with security and IDE support in mind is not the task we want to spend any time on. There is not that many use cases for Unicode in contracts in first place. Worse, JS doesn't have a native UTF-8/16 support, and even ohm.js doesn't correctly handle surrogate pairs.


  • ban Unicode characters everywhere in grammar
  • but allow it in strings and comments
  • but even in strings and comments ban all the characters that can change code layout: all line breaks except \n, and all RTL/LTR characters

For reference: there is also the UNICODE SOURCE CODE HANDLING technical standard: https://www.unicode.org/reports/tr55.

a test case with the form feed characters:

// comment with the form feed character before the constant "x"const x = 42;

// comment with the line tabulation character before the constant "y"�const y = 42;

const x = 42;