tweag/nickel

stdlib string functions should work on extended grapheme clusters

Closed this issue ยท 7 comments

Once #1005 is merged, string.length will count the number of extended grapheme clusters in a (UTF-8) string. This means that characters comprised of multiple code points, but still representing a "single character" are counted as one character for the purpose of this function.

However, other standard library string functions (or more accurately, the string primops) still operate on code points, e.g. string.substring, string.split, etc.

In general, this issue proposes that we should consider the extended grapheme cluster to be the smallest unit of Nickel strings, and all string manipulation functionality should be updated accordingly.

I should mention something that may be relevant for this issue. Under the grapheme-clustering semantics, it is not true that length (a ++ b) = length a + length b.

It may still be the appropriate semantics for string. It's just something to keep in mind.

Noting here that we'll remove string.codepoint and string.from_codepoint as part of this work as:

  1. they break the extended grapheme cluster abstraction, and
  2. there are no similar functions for strings in Nix
  3. we don't have any concrete use cases for them.

Another note: when looking at string.split I realised we have two options:

  1. `string.split "๐Ÿ‘จโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘จ" "โค๏ธ" == ["๐Ÿ‘จ ", "๐Ÿ’‹"]
  2. `string.split "๐Ÿ‘จโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘จ" "โค๏ธ" == ["๐Ÿ‘จโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘จ"]

Or more explicitly: we can choose to break apart extended grapheme clusters on component codepoints, or we can keep them together. I tested how Swift - which, imo, has the best UTF-8 extended grapheme cluster support of any mainstream language - handles this, and found that by default it takes the second option (though you can also get a "character view" of a string which breaks up differently).

I think it makes sense to copy this approach for now, especially as we're not currently aware of any use-cases for intensive string processing within Nickel.

I just merged #1200 which partly solves this. The remaining work is around the behaviour of regex search functions, which currently do break up grapheme clusters.

@matthew-healy I don't know if we will have time to finish that before 1.0. Anyway, I think the most important is already here: Unicode-aware regexps are hopefully more of an edge use-case. In the meantime, we probably want to update the documentation of regex-related functions with a big warning saying that those aren't (yet) unicode aware.

Yes, very fair. I'll try to spend some time updating the docs later today, as I'm away from tomorrow until Wednesday.

Removing from 1.0 milestone because we agreed that #1200 + #1212 are enough for 1.0, while not strictly closing this issue either.