aardappel/lobster

indexing in strings with special charachters dosen't work proparly

Hjagu09 opened this issue · 2 comments

example

print "å"[0]
print string_to_unicode("å")

output

195
[229]

expected output

229
[229]

testing with more characters gives me this:

  • 195 for åäöøæ
  • 266 for ←↓↑→
  • 194 for ¹²³ª

the same things happen with for loops

I'm afraid this does work properly, as indexing works by byte, not by unicode character.
Strings use a UTF-8 representation, so O(1) indexing would not be possible.

This is exactly the reason we have string_to_unicode: to turn it into a vector, which is indexable by unicode code point.

If you index a C++ std::string, you'll get the same result. Much like C++, a Lobster string does not promise its contents is UTF-8 (we use strings for abitrary binary buffers), only that if you store string data in it, it will be UTF-8.

Thank you, this issue can be closed