Notes from TC39 after-hours discussion
sffc opened this issue Β· 9 comments
I chatted with @wycats and @gibson042 with regard to sequence properties after the TC39 meeting this week. CC @mathiasbynens, @macchiati, @markusicu.
One intuition that we had when thinking about sequence properties was that users might like to think of sequence properties as describing grapheme clusters. For example, a sequence property like RGI_Emoji_ZWJ_Sequence would describe a single emoji grapheme. This also leads intuitively to the negation of sequence properties: it would match any grapheme that is not described by the sequence property.
However, my understanding is that this is not the mental model used in the Unicode proposal. That mental model is that the sequence properties may or may not describe grapheme clusters, and by its nature, the negation is meaningless.
One aspect of sequence properties as proposed which I find confusing is that it seems the sequence properties are not necessarily "greedy". For example, if you had Emoji-ZWJ-Emoji, would the sequence property be just as happy matching just the first Emoji as it would matching the whole grapheme? I find that behavior nontrivial to rationalize about. If this is true, I wonder if you've considered making a greedy and non-greedy mode?
Another idea that was brought up was to add a "grapheme mode" to regular expressions, similar to the "unicode mode" that operates on code points rather than code units. In this new mode, sequence properties would behave basically the same as code point properties, including the ability to negate them. That would be out of scope for this proposal, but it's something to keep in mind to make sure that the design of this proposal would be compatible with a possible future grapheme mode.
(Just edited this; I'd miswritten the first time around.)
The way I think of the finite sequences (like what is being proposed) is simply a longest-first alternation, like (aZb|aZc|a|...). It then has all of the characteristics of that alternation, including that there isn't an obvious negation. The emoji properties are just the equivalent of such sets of characters.
You could get the same effect as your grapheme cluster negation with look-ahead, however.
One aspect of sequence properties as proposed which I find confusing is that it seems the sequence properties are not necessarily "greedy". For example, if you had Emoji-ZWJ-Emoji, would the sequence property be just as happy matching just the first Emoji as it would matching the whole grapheme? I find that behavior nontrivial to rationalize about. If this is true, I wonder if you've considered making a greedy and non-greedy mode?
IMHO, the most useful behavior would be to do a greedy match (longest-first alternation). That is, in the following example, I'd expect the entire ZWJ sequence to be matched, and not just the U+1F468 (even though a string containing that code point by itself would still be matched):
const re = /^\p{RGI_Emoji_ZWJ_Sequence}$/u;
re.test('π¨πΎββοΈ'); // '\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F'
// β true + matches the entire sequence
re.test('π¨'); // '\u{1F468}'
// β true@macchiati I'd love to hear more about why your intuition is shortest-first. (It was a typo. Phew!)
Suppose that you wanted to match "a string that starts with an emoji and contains a nonzero number of other characters afterward". You might write the expression,
/^\p{RGI_Emoji_ZWJ_Sequence}.+$/u
However, this would not have the expected behavior on the dark-skin-doctor emoji. It would split the emoji up into pieces:
const re = /^(\p{RGI_Emoji_ZWJ_Sequence}).+$/u;
const match = re.exec('π¨πΎββοΈ'); // '\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F'Actual behavior: the regex matches, and match[1] == \u{1F468} (or whatever is the longest valid substring of the emoji). I postulate that a more intuitive behavior for lay people would be for the regex to not match, if you are thinking in the mindset of emoji being single entities that can't be broken up.
I think what is more natural depends on the context. Sometimes you want to require the position after the \p{RGI_Emoji_ZWJ_Sequence} (or before it) to be on a grapheme cluster boundary, sometimes not.
That isn't limited to the use of these sequences: there are other circumstances where you want to ensure that a particular point is on a particular kind of boundary (grapheme, word, linebreak, etc).
If you have \X (as in Perl, to match extended grapheme clusters), then I think you can use zero-width positive lookbehind to check boundaries. Doable, but a bit clunky. (This is off the top of my head; others should check.)
Much simpler would be to add some syntax to check grapheme cluster boundaries, such as the \b{g} suggested in UTS #18 (but not yet implemented by any regex engine I know of).
Then you could have expressions like:
/\b{g}\p{RGI_Emoji_ZWJ_Sequence}\b{g}.+$/u
but also other expressions:
/\b{g}(an|a|the|πΊπΏ|πΊ)\b{g}.+$/u
Notes from UTC:
RPR = Roozbeh Pournader
MSH = Markus Scherer
SFC = Shane F. Carr
MED = Mark E. Davis
MGR = Manish Goregaokar
BYG = Benjamin Yang
RPR: Grapheme clusters depend on language, font, etc. I would avoid them in nuanced software.
MSH: Grapheme clusters aren't stable, which could cause problems. About properties of strings, they could include strings that have multiple grapheme clusters. I prefer what Mark Davis had suggested.
SFC: What's an
MED: Exemplar characters for French.
MSH: In Slovak, we have ch.
MED: Grapheme cluster boundaries have a defined symbol.
RPR: Relying on grapheme cluster boundaries is problematic.
MED: There's two problems. First, the definition of grapheme boundaries can change over time. Second, the boundary might not be intuitive to the user.
RPR: The graphemes could be customized even in CLDR, right?
MSH: It's stable for a certain implementation.
BYG: Do we mean extended grapheme clusters too?
MED/RP: Yes
SFC: On the regex boundary syntax, \b{g}, would the boundaries refer to the whole string, or to substrings during the execution of the regex?
MED: Boundary requires context.
MSH: I would say, boundaries on the string provided to the regex engine.
MGR: Flag emoji are a good example.
MSH: \b{g} is defined in UTS 18 in Section 2.2 as Unicode extended grapheme cluster boundary.
We have discussed grapheme cluster semantics more recently and decided that they are out of scope for the current joint proposals.
Can we close this issue?
Closing the issue since weβre aligned in the decision, and the rationale is clearly documented in the above discussion.