tc39/proposal-regexp-v-flag

Interaction with properties of strings a.k.a. sequence properties

mathiasbynens opened this issue · 8 comments

It’s an explicit goal of this proposal to figure out how set notation interacts with properties of strings.

Here’s the high-level syntax again:

// difference/subtraction
[A--B]

// intersection
[A&&B]

// nested character class
[A--[0-9]]

You can imagine A and B being properties of strings. In that case, we‘d get e.g.

// match all multi-code point emoji sequences:
\p{RGI_Emoji}--\p{Emoji}

@msaboff mentioned he’d prefer different syntax depending on how the two proposals interact. Michael, could you elaborate?

I don't want to use character class syntax [..] for property of strings.

Because properties of strings are not character classes and aren't matched the same as character classes. They are predefined alternation.

While it is true that some set operations involving properties of strings resolve to a set of single code points, that isn't the general case.

Alternations support sets of single characters as well as sets of strings. Character class semantics are inappropriate to sets of strings. That is why it I want a different syntax for the different semantics. It believe it will reduce developer confusion as well as syntax issues with code already in the wild.

Character class semantics are inappropriate to sets of strings.

There was a time when I thought this, too, looking at ICU class UnicodeSet which is an implementation for character classes. It started out as a set of code points (initially limited to 16 bits, like all of ICU and most other Unicode libraries in the 90s) and got extended over time (full Unicode, set operations, strings).

When the addition of multi-character strings was first proposed, I was also scratching my head. I think Mark Davis and Alan Liu had a need for such strings in ICU Transliterator rules which use UnicodeSet patterns as part of their syntax.

At the time, I was working on lower-level code, such as character conversion. One of the APIs there returns a UnicodeSet of the characters that a converter supports. As it turns out, we had to add support for character sets that had single codes corresponding to sequences of two or more Unicode code points. For example, IBM EBCDIC Japanese codepage 1390 encodes things like <0254 0300> (LATIN SMALL LETTER OPEN O + COMBINING GRAVE ACCENT) and <304B 309A> (HIRAGANA LETTER KA + COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK). (Look for >< in ICU ibm-1390_P110-2003.ucm)

We have a similar API for a Collator, returning a UnicodeSet for those characters and strings that are tailored compared with the Unicode default sort order. It returns strings for contractions, such as "ch" in Slovak.

Also consider that Unicode does not always encode a "character" as a single code point. Some "characters" are documented formally as having been encoded as sequences, and are given names just like "regular characters". Indeed, UCD NamedSequences.txt includes LATIN SMALL LETTER OPEN O WITH GRAVE;0254 0300 and HIRAGANA LETTER BIDAKUON NGA;304B 309A — same as in the IBM Japanese codepage — in sections titled "Entries for JIS X 0213 compatibility mapping".

Note that some regex implementations support \N{character name}. \N{HIRAGANA LETTER BIDAKUON NGA} resolves to a two-code point sequence, which means that full support for this syntax inside character classes requires that we support multi-character strings there.

Related: Unicode CLDR provides data for the "exemplar characters" of many languages. This necessarily includes multi-character strings such as {x̣} (using braces as in ICU UnicodeSet). Look for "clusters" in the CLDR 38 data.

The Unicode regex spec also suggests support for exemplar characters in regex properties and character classes: \p{Exemplar_Main=fil} — "The main exemplar characters for Filipino: [a-nñ \q{ng} o-z]"

Finally, when we encoded emoji, some were represented as sequences from the start (e.g., keycaps & flags), and many others were added later (skin tone variations, gender variations, ...). Users still perceive them as single units.

In other words, because the character encoding model does not limit what users think of as "characters" to single code points, implementations that limit a "set of characters" to just single code points are incomplete.

Most characters are encoded as single code points, and thus most character classes, and most UnicodeSet instances, contain only those. However, when you need a way to handle any and all "characters", having to put some of them into an auxiliary structure is very awkward.

ICU class UnicodeSet got support for multi-character strings in 2002 (ICU-1749). This has been entirely successful.

In other words, because the character encoding model does not limit what users think of as "characters" to single code points, implementations that limit a "set of characters" to just single code points are incomplete.

Most characters are encoded as single code points, and thus most character classes, and most UnicodeSet instances, contain only those. However, when you need a way to handle any and all "characters", having to put some of them into an auxiliary structure is very awkward.

+1 to all that. Thanks for putting it so eloquently.

Drawing the line at the code point boundary is as arbitrary as drawing the line at the UTF-16 code unit boundary. We don't need to do it, and we certainly don't have to introduce new syntax just to ensure we preserve this boundary.

As of the 2021-may-25 TC39 meeting, this proposal officially subsumes the properties of strings proposal. PR #29 notes this in the readme. The draft spec text covers both set notations and properties of strings (as well as string literals), and the combined proposal advanced to stage 2.