Supported Modifier Flags
rbuckton opened this issue · 10 comments
In the Oct, 2021 plenary, @michaelficarra asked that we outline and provide motivating examples for each flag we are considering as a supported modifier.
The flags currently under consideration are:
i— ignore-case- Rationale — Toggling ignore-case is especially useful when matching patterns with varying case sensitivity, or when parsing patterns provided via JSON configuration. Especially useful when working with complex Unicode character ranges.
- Example — Match upper case ascii letter followed by upper or lower case ascii letter or '
const re = /^[A-Z](?i)[a-z']+$/; re.test("O'Neill"); // true re.test("o'neill"); // false // alternatively (defaulting to ignore-case): const re2 = /^(?-i:[A-Z])[a-z']+$/i;
- Example — Match word starting with
Dfollowed by word starting withDord(from .NET documentation, see 1)const re = /\b(D\w+)(?ix)\s(d\w+)\b/g; const input = "double dare double Double a Drooling dog The Dreaded Deep"; re.exec(input); // ["Drooling dog", "Drooling", "dog"] re.exec(input); // ["Dreaded Deep", "Dreaded", "Deep"]
m— multiline- Rationale — Flexibility in matching beginning-of-buffer vs. beginning-of-line or end-of-buffer vs. end-of-line in a complex pattern.
- Example — Match a frontmatter block at the start of a file
const re = /^---(?m)$((?:^(?!---$).*$)*)^---$/; re.test("---a"); // false re.test("---\n---"); // true re.test("---\na: b\n---"); // true
s— dot-all (i.e., "single line")- Rationale — Control over
.matching semantics within a pattern. - Example
const re = /a.c(?s:.)*x.z/; re.test("a\ncx\nz"); // flse re.test("abcdxyz"); // true re.test("aBc\nxYz"); // true
- Rationale — Control over
x— Extended Mode. This flag is proposed by https://github.com/tc39/proposal-regexp-x-mode- Rationale — Would allow control over significant whitespace handling in a pattern.
- Example — Disabling
xmode when composing a complex pattern:const idPattern = `[a-z]{2} \d{4}`; // space required const re = new RegExp(String.raw` # match the id (?<id>(?-x:${idPattern})) # match a separator :\s # match the value (?<value>\w+) `, "x"); re.exec("aa0123: foo")?.groups; // undefined re.exec("aa 0123: foo")?.groups; // { id: "aa 0123", value: "foo" }
Flags likely too complex to support:
u— Unicode. This flag affects how a pattern is parsed, not how it is matched. Supporting it would likely require a cover grammar and additional static semantics.v— Extended Unicode. This flag is proposed by https://github.com/tc39/proposal-regexp-set-notation as an extension of theuflag and would have the same difficulties.
Flags that will never be supported:
g— Global. This flag affects the index at which matching starts and not the matching behavior itself. Changing it mid pattern would have no effect.y— Sticky. This flag affects the index at which matching starts and not the matching behavior itself. Changing it mid pattern would have no effect.d— Indices. This flag affects the match result. Changing it mid pattern would have no effect.
Footnotes
For the examples, can you share how you'd do it without the relevant proposal?
i
Simple cases like /[A-Z][A-Za-z]/ are trivial:
// match an uppercase ASCII letter followed by a mixed-case ASCII letter
// with 'i' modifier:
/[A-Z](?i)[A-Z]/
// without 'i' modifier:
/[A-Z][A-Za-z]/However, more complex cases are far from trivial:
// match a mixed case "hello" followed by the exact characters "World"
// with 'i' modifier:
/(?i:hello) World/
// without 'i' modifier:
/[Hh][Ee][Ll][Ll][Oo] World/m
If you are in u mode, you could emulate non-m mode when in m mode using the proposed \A and \z buffer boundaries. However, if you are not in u mode, there's no way to match the buffer boundaries when in m mode:
// with 'm' modifier:
/^---(?m)$((?:^(?!---$).*$)*)^---$/
// without the 'm' modifier, in 'u' mode:
/\A---$((?:^(?!---$).*$)*)^---$/mu
// without the 'm' modifier, not in 'u' mode: not possible to invert when in 'm' modes
Its fairly complicated to invert the s flag in a RegExp without modifiers, and easy to get wrong:
// match /a.b/ outside of 's' mode, then /.+/ in 's' mode, then /c.d/ outside of 's' mode
// with 's' modifier
/a.b(?s:.)+c.d/
// without 's' modifier
/a.b(?:.|[\r\n\u2028\u2029])+c.d/
// match /a.b/ inside of 's' mode, then /.+/ outside of 's' mode, then /c.d/ inside of 's' mode
// with 's' modifier
/a.b(?-s:.+)c.d/s
// without 's' modifier
/a.b(?:(?![\r\n\u2028\u2029]).)+c.d/sThere's nothing with [^\s\S] for the dotAll case?
I'm not sure I understand what you mean. Can you clarify?
If you mean using [\s\S] to match everything, that's feasible for the first s example, sure. I don't see how it helps with the second example though.
I just want to share a little trick to emulate m and non-m mode without using ^ and $. This might be relevant for transpilers.
- /^ $/ == /(?<![\s\S]) (?![\s\S])/
- /^ $/m == /(?<!.) (?!.)/ // no `s` flag!This works for both u and non-u mode.
The modifiers supported by this proposal will be limited to i, m, and s. These may be potentially changed by future proposals (such as the x-mode proposal), but doing so is out of scope.
@rbuckton, I know this is already closed (and implemented in V8, yay!), but for interest's sake, note that it's very possible to emulate presence or lack of s and m. I do it in regex-make to locally apply the presence or absence of local flags for RegExp instances interpolated into a template.
mIf you are in
umode, you could emulate non-mmode when inmmode using the proposed\Aand\zbuffer boundaries. However, if you are not inumode, there's no way to match the buffer boundaries when inmmode
Emulating is possible without u mode or buffer boundaries.
- Emulate an
mmode^:(?<=^|[\n\r\u2028\u2029]) - Emulate a non-
mmode^:(?<![^]) - Emulate an
mmode$:(?=$|[\n\r\u2028\u2029]) - Emulate a non-
mmode$:(?![^])
sIts fairly complicated to invert the
sflag in a RegExp without modifiers, and easy to get wrong:[...] // without 's' modifier /a.b(?:.|[\r\n\u2028\u2029])+c.d/ [...] // without 's' modifier /a.b(?:(?![\r\n\u2028\u2029]).)+c.d/s
It's easier than that:
- Emulate an
smode.:[^] - Emulate a non-
smode.:[[^]--[\n\r\u2028\u2029]](withv) or(?:(?![\n\r\u2028\u2029]).)for a less efficient version withoutv(same as you showed in the quote).
Note that, like your (?:.|[\r\n\u2028\u2029]) example, [^] either matches full code points or doesn't based on the presence of flag u/v.
- Emulate a non-
smode.:[[^]--[\n\r\u2028\u2029]](withv) or(?:(?![\n\r\u2028\u2029]).)for a less efficient version withoutv(same as you showed in the quote).
Or [^\n\r\u2028\u2029].
FWIW: In the example “Match a frontmatter block at the start of a file” work, line terminators have to be consumed. This works for me:
const re = /(?-m:^)---\r?\n((?:^(?!---$).*\r?\n)*)^---$/m;
assert.equal(re.test('---a'), false);
assert.equal(re.test('---\n---'), true);
assert.equal(
re.exec('---\n---')[1],
''
);
assert.equal(
re.exec('---\na: b\n---')[1],
'a: b\n'
);