tc39/proposal-regexp-v-flag

Backwards-compatible syntax

mathiasbynens opened this issue · 46 comments

We could require the u flag (which we would do anyway) and then use \UnicodeSet{…} to introduce new syntax in a backwards-compatible manner, since \U throws.

(We made sure of that here: https://web.archive.org/web/20141214085510/https://bugs.ecmascript.org/show_bug.cgi?id=3157)

/\UnicodeSet{[\p{…}]--[a-z]}/u

Nice!

During the November TC39 meeting, @waldemarhorwat expressed concerns w.r.t. backwards-incompatible syntax, and @michaelficarra expressed concerns with introducing a new flag. I believe this solution might address both concerns. Waldemar, Michael, did I get that right?

Correct, this is possible in Unicode regexps using currently unused backslash escapes. Still not totally convinced it's common enough to warrant space in our already quite complex Pattern grammar.

My main concern which I expressed at the meeting is that compatibility concerns with trying to retrofit these into existing character classes might drive us to make syntax that has seriously confusing gotchas, trapdoors, or special cases. I'd prefer syntax be simple and regular. There are several possible ways of getting there. A flag or something like the \UnicodeSet proposed here seem like decent ways of getting there, but there may also be others.

Let me repost my position in general:

Part of what's needed is restricting what can go inside the new-style sets to not include any of the Appendix B cruft or other unescaped special characters with special meanings. An example of what can go wrong is:

Suppose, just for the sake of a thought experiment, we defined a new flag (let's call it F) to indicate that []'s inside regexps have the new behavior and can nest but did not remove the ability to add unescaped /'s inside []. Picking a hypothetical syntax where & is some operator and you can nest []'s, things like this would work:

/foo[/&[abc]&;&\p{xyz}]/F

But then someone might commute things a bit inside the character class:

/foo[[abc]&/&;&\p{xyz}]/F

Oops!

Disallowing unescaped slashes inside new-style sets would solve this.

Summarizing discussions we had in the meantime; major bike-shedding here.
(All of this is still to be guarded by the u flag.)

character class prefix

We have been assuming a \USomething{...new syntax...} like the early suggestion of \UnicodeSet{...}.

However, we should not actually use "UnicodeSet" because the proposal we are working towards is noticeably different from the syntax that ICU class UnicodeSet uses, so that term would be confusing.

I suggested \USet{...}. @mathiasbynens thinks that's too short and prefers \UniSet{...}.

I also suggested that the term "set" does not quite fit because in regular expressions these things are usually called "character classes". So we could use something like \UClass{...}. @macchiati chimed in with \UCC{...} for UnicodeCharacterClass.

nested classes

Regardless of the top-level syntax, we propose that nested classes use conventional, simple [character class] syntax. Using the distinguishing syntax for nested classes as well would be way too cumbersome.

Example: /abc \USet{\p{Decimal_Number}--[0-9]} xyz/u

curly braces vs. square brackets

I suggested using square brackets at top level as well, to make the new type of character class look more like the existing one, just with a distinct prefix.

For example, \USet[...new syntax...].

This would also avoid having to treat curly braces (at least }) as special.

However, most of us feel that the pattern of \someLetter{...} with curly braces, as in \p{property} and \u{12345} etc., is quite ingrained, and square brackets would look weird.

stateful modifier

@macchiati pointed out that some regex engines support stateful modifiers inside the pattern string that change the behavior of the whole expression, or of the part of an expression between an "on" flag and an "off" flag. For example, in some engines, (?i) makes the regex case-insensitive.

He suggested that we could use such a modifier to change the syntax and semantics of affected character classes, instead of a per-class prefix.

We would not use (?U) because that has meaning in PCRE. It looks like the letters [aA-E fF gG hH I j kK lL-N oO-Q rR-T u vV W yY zZ] are available. We could use (?C) “class” or (?u) “Unicode” or similar.

Example:
/abc \USet{\p{Decimal_Number}--[0-9]} klm \USet{\p{Other}--\p{Format}--\p{Control}} xyz/u
/(?u)abc [\p{Decimal_Number}--[0-9]] klm [\p{Other}--\p{Format}--\p{Control}] xyz/u

This seems intriguing, but ECMAScript does not currently appear to support any such modifiers, and [class] syntax would differ depending on the presence of an earlier modifier, so this might be a more disruptive change for how the specification is written.

The important thing to determine (re the stateful modifier) is whether the construct (eg /(?p).../) currently causes a syntax error. That would clear the way for adding it.

Note that the primary use case would be with the stateful modifier is the very first thing in the regex — not embedded part way through the string. That is, I think it should be a non-goal to support part of the regex in "USet mode" and part not in USet mode.

That is, I think there is no real advantage to allowing either syntax to only cover part of the regex.

/abc \USet{\p{Decimal_Number}--[0-9]} xyz/u
could always be restated as
/\USet{abc \p{Decimal_Number}--[0-9]} xyz}/u
(although if "abc" were something like a--c, then it might need some escaping).

Regardless of the top-level syntax, we propose that nested classes use conventional, simple [character class] syntax. Using the distinguishing syntax for nested classes as well would be way too cumbersome.

Unless you disallow unescaped / inside character classes, that's incompatible with non-Unicode regular expressions in existing ECMAScript. The problem is that the combination of nested [ and bare / makes it impossible to detect the end of the regular expression before you know whether it is in Unicode mode or not (without nested [ character classes you can find the end of the regular expression without knowing whether it is in Unicode mode). If you can't detect the end of the regular expression, you can't tell whether there is a u flag there. If you can't tell whether there is a u flag after the regular expression, you can't tell whether to parse \U as just a literal U. Thus you end up in a Catch-22 situation.

Unless you disallow unescaped / inside character classes, that's incompatible with non-Unicode regular expressions in existing ECMAScript.

Yes, you have mentioned problems with literal / before. When I wrote the pseudo-spec in our WIP "complete proposal" (linked from issue #12) I excluded / from the ClassCharacter rule there.

Is it correct that an escaped slash, as in \/, is ok? Or would one have to write \x2F?

So far we have “exclude / for JS regex literals (TODO: confirm restrictions)” -- are there other things to watch out for?

\/ is ok.

An example of a problem case might be /USet{[[a-z]/+"]}"/u, which is a chameleon:

  • If you parse it in non-Unicode mode, it's a non-Unicode regular expression without the flag u that matches the literal characters USet{ followed by a character class including [ and lower-case ASCII letters and then appends a string constant "]}" divided by the variable u.
  • If you parse it in Unicode mode, it's a Unicode regular expression with the flag u that matches a USet with nested character classes.

Both are consistent. You can't tell whether the regular expression is in Unicode mode until you find its end, but where it ends depends on whether it's in Unicode mode.

The problem goes away if you either disallow nested character classes or disallow unescaped / inside USets.

Nice example!

\/ is ok.
...
The problem goes away if you either disallow nested character classes or disallow unescaped / inside USets.

We do propose to disallow unescaped / inside the new character class syntax, so it sounds like we are good here.
Please let us know if there are other gotchas like this.

I just gave the core of the example. In one of the chameleon modes it interprets the u closing the regular expression as a variable, so you need to define that.

In Chrome or Firefox:

var u = 1; /USet{[[a-z]/+"]}"/u

produces the output "/USet{[[a-z]/NaN".

Thanks for the example.

The more I think about it, the more I think for backwards compatibility /(?U)…/ is a better choice to signal the new syntax than /\USet{…}/:

  1. As desired, /(?U)…/ will fail on older JavaScript implementations with or without /u rather than produce unintended results. /\USet{…}/ will not fail; instead it will produce unintended results.

  2. It needs no termination character, nor a need to quote that terminator inside the …; the span that it affects is to the end of the regex.

Here's a little comparison:

OUTPUT:

> "/USet{…}/"
> null
> "/USet{…}/u"
> "FAIL"
> "/(?U)…/"
> "FAIL"
> "/(?U)…/u"
> "FAIL"

CODE:

console.log("/\USet{…}/");
try {
  console.log(/\USet{[\p{L}&&\p{scx=Grek}]*}/.exec('δ'));
} catch (error) { 
  console.log("FAIL");
}

console.log("/\USet{…}/u");
try {
  console.log(/\USet{[\p{L}&&\p{scx=Grek}]*}/u.exec('δ'));
} catch (error) { 
  console.log("FAIL");
}

console.log("/(?U)…/");
try {
  console.log(/(?U)[\p{L}&&\p{scx=Grek}]*/.exec('δ'));
} catch (error) { 
  console.log("FAIL");
}

console.log("/(?U)…/u");
try {
  console.log(/(?U)[\p{L}&&\p{scx=Grek}]*/u.exec('δ'));
} catch (error) { 
  console.log("FAIL");
}

I'm concerned that /(?U)…/ (or any other choice of letter) would be confusing to those who are familiar with the syntax from other regular expression implementations, including Java, Perl, PCRE or Python. In these, the letter(s) can indicate any of the supported regex flags, and setting a flag in this way has the same effect as other ways of setting it. With this proposal, U is not a normal flag, U cannot be specified as a flag, and other flags cannot be specified with the /(?U).../ style syntax. It just feels inconsistent.

Going this way would also complicate any future effort to extend ES regexp to support /(?x).../ for flags in general.

I don't share those concerns.

  1. Every regex flavor supports its own choice of letters in (?...), so nobody can expect those to carry over to other regex flavors
  2. If JavaScript were to support the regex flags in the future, it could do so without a problem, as long as U (or whatever we choose) doesn't collide with current flags (which U doesn't).
  3. The advantages of (?U) or similar syntax are substantial.
    a) No terminator is necessary, meaning also that such characters don't need to be escaped inside.
    b) We don't actually need to require the /u — the (?U) mode could imply that (if the committee desires)
    c) Separate UnicodeSets don't need duplicate introducer syntax, so

/\USet{\p{abcd}--\p{defg}}(_\USet{\p{abcd}--\p{defg}})+/u
can be written as the much simpler

/(?U)[\p{abcd}--\p{defg}](_[\p{abcd}--\p{defg}])+/

I also prefer a flag-like (?U) to USet{...}.
My concern is that (?U) looks exactly like, and acts sort-of like a normal regex mode flag, but it is not.

If the (?U) syntax works for U, then why not for the other flags, i,m,s, etc. As it does in other regex flavors.
Why not /[\p{abcd}--\p{defg}]/U; ,
taking advantage of the existing flag setting mechanisms, and avoiding introducing a new to ES (?flag) syntax.
var myRe = new RegExp('[\p{abcd}--\p{defg}]', 'U'); could work too.
U could also imply u, your point 3b, above.

sffc commented

About flags: I think the comment from @erights at TC39 hit the nail on the head. We added the /u flag to create better Unicode support by processing the string by code points instead of code units. Now the Unicode people are coming back and saying, "oh, now we need strings and nested character classes." What is the roadmap for regular expressions in ECMAScript? When will we be done adding things? If we add a new mode now, we can't guarantee that we won't need "Unicode Mode Part 3" in another 5-10 years.

Personal preference: I prefer the \USet{} notation, because:

  1. It provides a clear separation between traditional, flat, code-point-based character classes and the novel, nested, string-based character classes.
  2. It creates a scope, between the {}, where we can do whatever we want, as long as it doesn't cause problems with the regex lexer.
  3. A proposal to "add a new flag that fundamentally changes how character classes work in ECMAScript" is a lot more scary intimidating than a proposal to "add a new construct to support set operations on nested sets of characters and strings".

@macchiati As noted in #2 (comment) (?U) has meaning in PCRE so we should avoid that letter.

About flags: I think the comment from @erights at TC39 hit the nail on the head. We added the /u flag to create better Unicode support by processing the string by code points instead of code units. Now the Unicode people are coming back and saying, "oh, now we need strings and nested character classes." What is the roadmap for regular expressions in ECMAScript? When will we be done adding things? If we add a new mode now, we can't guarantee that we won't need "Unicode Mode Part 3" in another 5-10 years.

Note that this is not just because of “new ideas” but also because of ECMAScript's desire to be 100% backwards compatible, even for what seems like unlikely or unnecessary usage (such as double punctuation). That is a good and understandable goal, but where syntax has not been defined with suitable extension options, it requires some sort of versioning.

ECMAScript also seems to prefer incremental, limited proposals. If one step does not provide for extensions sufficient for what happens to be the next step, then we need to partially start over. We are lucky that the u flag forbids escapes with arbitrary letters, so we can use a new letter for new syntax. We are unlucky that operators and nested classes were not supported earlier, nor readable reserved syntax left open for them.

I don't know about a roadmap for regular expressions because I am not a general regex expert. UTS #18 has a number of things that are useful but not yet supported in ECMAScript; it seems like our current proposals, and the reserved syntax they provide, reasonably cover what's in there. In particular, we are overcoming major hurdles with strings, nested classes, and reserved double punctuation (as well as simplifying the handling of dash and slash). If this were to “hold” for, say, ten years, that seems pretty good.

If someone here knows a lot about regex syntaxes across many engines, it would be useful to list whatever other not-yet-supported syntactic features we might leave open.

Personally, I find the use of an in-pattern modifier as attractive as Mark does. If that would want to be equivalent with an external flag, then that seems ok assuming we cover foreseeable syntax needs. The \Uprefix{...} is clunky but “I can live with it” as we say in other committee meetings.

From discussion with Waldemar:

\Usomething{...} looks like it should be enough, but if someone omits the /u flag, then these are just characters to be matched. It would also be nice to have the outer scope be [...] not {...}.

A flag or modifier is a possibility, and may be less likely to be forgotten when the character classes otherwise just use [square brackets].

If we want to avoid letters that are modifiers somewhere (see above) or flags in ES regex, then we could pick one from [A-IK-TVWYZafhj-lorvz]. For example, “v is the next u”, and should imply it/build on it.

I also like the idea of using w, as it’s “double-u”.

sffc commented

If we were to add a new flag, I like v because it is used instead of u in Greco-Roman architecture.

CHICAGO PVBLIC LIBRARY

"twice as unicode" versus "sharper unicode"

sffc commented

If we were to add a flag, I think the flag should not only add the Unicode set notation stuff. The new flag should also do some subset of:

  • Tokenize based on grapheme clusters instead of code points (like in Swift and Perl 6)
  • Add properties of strings
  • (what else is on the UTS 18 wish list?)

If, however, we are only adding the Unicode set notation, then, IMO, it should not be its own flag.

If we were to add a flag, I think the flag should not only add the Unicode set notation stuff. The new flag should also do some subset of:

  • Tokenize based on grapheme clusters instead of code points (like in Swift and Perl 6)
    ...

I think “Tokenize based on grapheme clusters” goes too far.

Tokenizing based on grapheme clusters is an interesting idea, and I brought it up as well. The things that give me pause are its stability over time — some of the official emoji being added are fairly long strings of various Unicode characters. Some of them are also not self-synchronizing, in that you can't tell what's a grapheme if you start searching from the middle of a string. In other words, there exist Unicode characters A, B, C, D, E, etc. such that graphemes (which I denote by the constituent characters of a grapheme enclosed by «») can be constructed either as:

«AB»«CD»«EF»«GH»«IJ»«K»

or:

«BC»«DE»«FG»«HI»«JK»

Country flags are one example of this phenomenon.

sffc commented

The things that give me pause are its stability over time

Correct; the v or w mode would be unstable over time, since the definition of grapheme clusters changes in each Unicode release. That's a condition you would need to assume by using the new mode. However, note that \p{}, the basis for all segmentation algorithms, is already exposed (and unstable over time), so we aren't really expanding the surface of unstable constructs.

I think “Tokenize based on grapheme clusters” goes too far.

Tokenizing by grapheme clusters goes too far, but not making code point classes into string classes? :)

What I'm trying to say is that if we add a flag, we should try to do everything right and not leave any unplugged holes.

Tokenizing based on grapheme clusters has all sorts of weird complications, where separate pieces of a pattern expression could combine during matching to match a composed grapheme cluster. I started to look at this as a way to implement Java's canonical equivalence mode in ICU, but it was messy enough that I set it aside.

At the time I was looking at it, some years back, Java's canonical equivalence mode seemed pretty broken if you started poking at edge cases.

Something probably could be done in this area, but it needs to be optional - people still want to be able to match on code points, finding combining marks for example. And it wants some experimentation before saying the ideas are ready for any sort of standardization.

sffc commented

Something probably could be done in this area, but it needs to be optional - people still want to be able to match on code points, finding combining marks for example.

u mode is for code points; v mode is the optional extension for grapheme clusters. u mode remains relevant, rather than being subsumed by v.

u mode is for code points; v mode is the optional extension for grapheme clusters. u mode remains relevant, rather than being subsumed by v.

v mode is for nested bracket expressions with -- and && operators, which are very useful with traditional code point based matching. Grapheme cluster based parsing and/or matching would be a whole separate feature, and would need to be independently settable. And so fundamentally changes how matching works that I don't think it's near ready.

I want us to be able to move forward with properties-of-strings and set operators, for which we are converging on a proposal.

I assume that "grapheme cluster tokenization" means things like . matching a whole grapheme cluster. UTS #18 had something like that in "level 3" which was removed last year because of too many issues, lack of demand, and lack of implementation.

In particular, it has to be possible to opt into the features we know we need without also opting into grapheme cluster tokenization.

Is anyone really asking for grapheme cluster tokenization? Is anyone working on it? Sounds like it could take years to figure it out.

Stability: It feels like there is a difference in the degree of destabilization, between the set of characters/strings changing for which some property is true (which affects what that property matches) vs. grapheme cluster/token boundaries shifting when UAX #29 or its CLDR tailoring changes (which changes all matching when based on grapheme clusters).

sffc commented

Right, by "grapheme cluster tokenization", I meant . matches a grapheme cluster instead of a code point. Just like u mode makes . match a code point instead of a code unit.

I think operating on grapheme clusters instead of code points is an interesting idea, but it's too unstable and too big a change to include it in this proposal. It feels like everything else we'd want to enable through this new flag (u-flag features, set operations, properties of strings, possibly literal strings in character classes, and no more Annex B oddities) has a much clearer motivation and has been explored more thoroughly.

One other thing we could include in this new flag is Unicode-aware \w, \d, and \b. I originally proposed this to be part of the u flag but it was rejected out of fear it would hurt adoption of the u flag. tc39/proposal-regexp-unicode-property-escapes#22 (comment) We also could take it one step at a time, and ban \w, \d, and \b with the new flag for now, and then decide on their behavior later.

Note that in addition to the flag letter bikeshed (v, w, or something else) we also need to decide what the full flag name would be, as it would need to be exposed as a new getter on RegExp.prototype.

For example, the s flag corresponds to the dotAll getter: https://tc39.es/ecma262/#sec-get-regexp.prototype.dotAll

Perhaps the name corresponding to our new v/w flag could be uniSet? Or is there something more general?

Moving the flag/getter discussion to #14.

As I mentioned during the meeting, v (or whatever we call it) mode should affect only the syntax of [] expressions, not opt into grapheme-only matching for the entire regular expression. The latter causes enough problems that there must be a way to use the more powerful [] set union and intersection expressions without also opting into graphemes.

Grapheme matching is about more than just deciding what . matches. For example, currently /🇸🇪*/u matches "🇸🇪", "🇸🇪🇪🇪", and "🇸🇪🇪🇪🇪🇪" but not "" or "🇸🇪🇸🇪".

With grapheme matching (enabled, say, with a different flag such as w) I'd expect /🇸🇪*/w to match "", "🇸🇪", "🇸🇪🇸🇪", and so on. It's a complicated question whether /🇸🇪+/w would match the inside of "🇪🇸🇪🇸".

Has the option of expanding the definition of \p{...} been considered? For example, \p{White_Space--Line_Break=Glue} seems pretty nice.

\p{...} is also short enough that we could consider using it for nested character classes to avoid having to add a whole new flag.

Closing this issue. I think we have firmly settled on using a new flag.

There are other issues for bike-shedding on the exact flag and getter, and further details.

pygy commented

This can be implemented with today's engines in a way that is completely backwards-compatible.

import {charSet} from 'compose-regexp'

const LcGrekLetter = charSet.intersection(/\p{Lowercase}/u, /\p{Script=Greek}/u)
LcGrekLetter.test("Γ") // false
LcGrekLetter.test("γ") // true
LcGrekLetter.test("x") // false

console.log(LcGrekLetter) // /(?!(?!\p{Script=Greek})\p{Lowercase})\p{Lowercase}/u

Engines could easily detect patterns like this and do character ranges operations under the hood. The core pattern is

// notAhead is negative lookAhead
function csDiff(a, b) {return sequence(notAhead(b), a)}

Inter can be optimized with the same logic

function csInter(a, b) {return sequence(notAhead(csDiff(a, b)), a)}

union is just (a, b) => /a|b/ which once again can be optimized if both alternatives match exactly one code point.

live here