JoeStrout/miniscript

[C++] Chinese characters don't work as identifiers

Closed this issue · 14 comments

Hi,
Here's the error I got:

> 问好='hi'
Compiler Error: got Unknown(?) where number, string, or identifier is required [line 1]

Thanks.

BTW I got here from Farmtronics mod.

String literals in the code are enclosed by double quotes (").
Try 问好 = "hi".
ms里的字符串只用双引号括起来。

C# MiniScript already does. (But be sure to use proper quotation marks.)
image

...so if it's not working in Farmtronics, you should open an issue there.

But I see that the C++ version of MiniScript (e.g. command-line MiniScript) struggles with this identifier too. So I'll leave this issue open here (with an adjusted title).

> hi = "你好"
> 你好 = "hi"
> hi
你好
> 你好
hi

It works perfectly fine with Chinese characters [for me, in command-line MiniScript on Windows].

Hmm, well it's not working for me in command-line MiniScript on MacOS. So, I'm glad that it's working in some cases, but I'll leave this open until we've pinned down exactly what is going on.

Thank you all for the quick responses. Weird it doesn't work for me in windows powershell even with double quotes.
屏幕截图 2023-12-11 154350

MiniScript seems unable to handle with some unicodes very well.

> helloWorld = "你好世界"
> helloWorld.len
6  // should be 4
> for i in helloWorld
>>> print i
>>> end for  // try to print all characters

愫


澜

// there are unprintable characters

Another example here.

Python does it well enough:

>>> helloWorld = '你好世界'
>>> len(helloWorld)
4
>>> for i in helloWorld:
...     print(i)
...
你
好
世
界

As for Unicode in strings, that's working fine here on MacOS.

image

I wonder if you're on Windows? We've already seen that the Windows terminal some users use does not handle Unicode properly; for example, print char(9824) should print a spade character, but apparently some Windows users see some random Chinese character instead.

Since the Unicode-handling code in MiniScript is the same on all platforms (and does not rely on OS support), I suspect the problem lies in the terminal, where you're typing this code. If you make a script file containing the same test, and be sure to save that script file as UTF-8, does it work then?

(Of course it's odd that Python handles the same test OK, presumably using the same terminal — I'm not sure what to make of that.)

I'm on windows and both tests are using the same windows terminal(command-line).

Works in WSL Ubuntu as well:

khuxkm@wslubuntu:~/miniscript/build$ cat test.ms
helloWorld = "你好世界"
print helloWorld.len

for i in helloWorld
        print i
end for
khuxkm@wslubuntu:~/miniscript/build$ ./miniscript test.ms
4
你
好
世
界

Copying my Ubuntu working script to Windows creates the following:

image

Which is... "correct"... but the glyphs are wrong. (Opening the file in notepad confirms that the unicode codepoints are correct.)

Yeah, that sure looks like an issue with the terminal program, not MiniScript itself. Poke around in the settings. I bet there's somewhere to set what encoding it uses to display stdout. It should be set to UTF-8 (at least when using MiniScript).

It appears to be terminal's fault. The default encoding method is GBK, and after changing that to UTF-8, problems get solved.
(Though I'm still amazed at how Python handles the same test OK.)

Finally things get clear.

  • When the encoding method is GBK, Chinese characters can be identifiers, but they don't work well in strings.

  • When the encoding method is UTF-8, Chinese characters don't work as identifiers, but they work well in strings.

Haha, so ironical.

The problem is that the C++ MiniScript passes around chars, assuming them to be equivalent to C# chars, which are Unicode-aware, when in reality, C++ char is 8 bits. In order to properly handle Unicode codepoints above 255, we'd need wchar_t at least (UTF-32, except on Windows where, due to backcompat, it's UTF-16).

The interesting thing is, I'm not entirely sure any version of MiniScript (aside from my Lua lexer) "properly" implements the intended identifier logic (underscore, lowercase letters, uppercase letters, numbers, and any Unicode code point above 0x9F); the C# implementation uses C# char, which is UTF-16 (the surrogate pairs get through because they're above 0x9F but it's not really decoding them as their Unicode codepoints at that level), and the C++ implementation uses char which is only 8 bits. (My Lua lexer uses utf8lib to decode at the codepoint level instead of a naive "trust the implementing language to handle indexing a string properly" approach (since Lua strings are 8-bit safe, I have to), though it works out for us in C# (since both of the surrogate pairs get past IsIdentifier).)

All versions accept any characters in strings (except newlines now), only checking for quotes to end the string (or newlines to make an error), so that was never an issue (by the time you're indexing MiniScript strings, you're in an environment that always interprets Unicode strings properly.