Consideration for Perl-like `(?[])` extended character classes instead of a flag
rbuckton opened this issue · 5 comments
I've been researching regular expression syntax in various languages and engines to inform possible future proposals to expand the ECMAScript regular expression syntax. One of the features I've been reviewing is Perl's Extended Bracketed Character Classes, which support operations such as:
- Intersection (
&
) - Union (
+
or|
) - Subtraction (
-
) - Symmetric Difference (
^
) - Complement (
!
) - Grouping (
(
,)
)
In this case, such a character class uses the tokens (?[
and ])
. The contents of the expression can contain the above tokens, whitespace (which is ignored), character classes, metacharacters (such as \p{..}
, \s
, etc.), and certain escape sequences (such as \x0a
, etc.). This allows you to write complex character classes like the following (based on the examples in the explainer):
# non-ASCI digits
(?[ \p{Decimal_Number} - [0-9] ])
# spans of word/identifier letters of specific scripts
(?[ \p{Script=Khmer} & [\p{Letter}\p{Mark}\p{Number}] ])
# breaking spaces
(?[ \p{White_Space} - \p{Line_Break=Glue} ])
# non-ASCII emoji
(?[ \p{Emoji} - \p{ASCII} ])
As well as classes like the following (from the perlre documentation):
# Matches digits in the Thai or Laotian scripts
(?[ ( \p{Thai} + \p{Lao} ) & \p{Digit} ])
Currently, (?[
is not valid RegExp syntax (with or without the u
flag), so it provides an opportunity to add syntax to cover set notation functionality without needing to introduce a new flag.
Previous discussion: #2
Also, we are in the process of expanding the scope of our proposal slightly to fix problems with some existing syntax and semantics -- and that requires a new flag (which in turn gives us an opportunity to fix such problems).
We had looked at the experimental Perl syntax; it is syntactically a real outlier compared with how other regex engines have extended their syntax.
I do like the Perl syntax if we were to revisit the syntax-only route.