Supported Modifier Flags

Question

Supported Modifier Flags

rbuckton opened this issue 4 years ago · 10 comments

In the Oct, 2021 plenary, @michaelficarra asked that we outline and provide motivating examples for each flag we are considering as a supported modifier.

The flags currently under consideration are:

i — ignore-case
- Rationale — Toggling ignore-case is especially useful when matching patterns with varying case sensitivity, or when parsing patterns provided via JSON configuration. Especially useful when working with complex Unicode character ranges.
- Example — Match upper case ascii letter followed by upper or lower case ascii letter or '
```
const re = /^[A-Z](?i)[a-z']+$/;
re.test("O'Neill"); // true
re.test("o'neill"); // false

// alternatively (defaulting to ignore-case):
const re2 = /^(?-i:[A-Z])[a-z']+$/i;
```
- Example — Match word starting with D followed by word starting with D or d (from .NET documentation, see ¹)
```
const re = /\b(D\w+)(?ix)\s(d\w+)\b/g;
const input = "double dare double Double a Drooling dog The Dreaded Deep";
re.exec(input); // ["Drooling dog", "Drooling", "dog"]
re.exec(input); // ["Dreaded Deep", "Dreaded", "Deep"]
```
m — multiline
- Rationale — Flexibility in matching beginning-of-buffer vs. beginning-of-line or end-of-buffer vs. end-of-line in a complex pattern.
- Example — Match a frontmatter block at the start of a file
```
const re = /^---(?m)$((?:^(?!---$).*$)*)^---$/;
re.test("---a"); // false
re.test("---\n---"); // true
re.test("---\na: b\n---"); // true
```

s — dot-all (i.e., "single line")

Rationale — Control over . matching semantics within a pattern.

Example

const re = /a.c(?s:.)*x.z/;
re.test("a\ncx\nz"); // flse
re.test("abcdxyz"); // true
re.test("aBc\nxYz"); // true

x — Extended Mode. This flag is proposed by https://github.com/tc39/proposal-regexp-x-mode

Rationale — Would allow control over significant whitespace handling in a pattern.

Example — Disabling x mode when composing a complex pattern:

const idPattern = `[a-z]{2} \d{4}`; // space required
const re = new RegExp(String.raw`
  # match the id
  (?<id>(?-x:${idPattern}))
  
  # match a separator
  :\s
  
  # match the value
  (?<value>\w+)
`, "x");

re.exec("aa0123: foo")?.groups; // undefined
re.exec("aa 0123: foo")?.groups; // { id: "aa 0123", value: "foo" }

Flags likely too complex to support:

u — Unicode. This flag affects how a pattern is parsed, not how it is matched. Supporting it would likely require a cover grammar and additional static semantics.
v — Extended Unicode. This flag is proposed by https://github.com/tc39/proposal-regexp-set-notation as an extension of the u flag and would have the same difficulties.

Flags that will never be supported:

g — Global. This flag affects the index at which matching starts and not the matching behavior itself. Changing it mid pattern would have no effect.
y — Sticky. This flag affects the index at which matching starts and not the matching behavior itself. Changing it mid pattern would have no effect.
d — Indices. This flag affects the match result. Changing it mid pattern would have no effect.

https://docs.microsoft.com/en-us/dotnet/standard/base-types/miscellaneous-constructs-in-regular-expressions#inline-options ↩

Answer 1 · 2021-11-22T20:43:57.000Z

For the examples, can you share how you'd do it without the relevant proposal?

Answer 2 · 2021-11-23T01:28:31.000Z

`i`

Simple cases like /[A-Z][A-Za-z]/ are trivial:

// match an uppercase ASCII letter followed by a mixed-case ASCII letter

// with 'i' modifier:
/[A-Z](?i)[A-Z]/

// without 'i' modifier:
/[A-Z][A-Za-z]/

However, more complex cases are far from trivial:

// match a mixed case "hello" followed by the exact characters "World"

// with 'i' modifier:
/(?i:hello) World/

// without 'i' modifier:
/[Hh][Ee][Ll][Ll][Oo] World/

`m`

If you are in u mode, you could emulate non-m mode when in m mode using the proposed \A and \z buffer boundaries. However, if you are not in u mode, there's no way to match the buffer boundaries when in m mode:

// with 'm' modifier:
/^---(?m)$((?:^(?!---$).*$)*)^---$/

// without the 'm' modifier, in 'u' mode:
/\A---$((?:^(?!---$).*$)*)^---$/mu

// without the 'm' modifier, not in 'u' mode: not possible to invert when in 'm' mode

`s`

Its fairly complicated to invert the s flag in a RegExp without modifiers, and easy to get wrong:

// match /a.b/ outside of 's' mode, then /.+/ in 's' mode, then /c.d/ outside of 's' mode
// with 's' modifier
/a.b(?s:.)+c.d/

// without 's' modifier
/a.b(?:.|[\r\n\u2028\u2029])+c.d/

// match /a.b/ inside of 's' mode, then /.+/ outside of 's' mode, then /c.d/ inside of 's' mode
// with 's' modifier
/a.b(?-s:.+)c.d/s

// without 's' modifier
/a.b(?:(?![\r\n\u2028\u2029]).)+c.d/s

Answer 3 · 2021-11-23T03:30:51.000Z

There's nothing with [^\s\S] for the dotAll case?

Answer 4 · 2021-11-23T05:27:35.000Z

I'm not sure I understand what you mean. Can you clarify?

Answer 5 · 2021-11-23T05:29:05.000Z

If you mean using [\s\S] to match everything, that's feasible for the first s example, sure. I don't see how it helps with the second example though.

Answer 6 · 2022-03-16T13:07:25.000Z

I just want to share a little trick to emulate m and non-m mode without using ^ and $. This might be relevant for transpilers.

- /^ $/ == /(?<![\s\S]) (?![\s\S])/
- /^ $/m == /(?<!.) (?!.)/ // no `s` flag!

This works for both u and non-u mode.

Answer 7 · 2022-06-07T20:13:52.000Z

The modifiers supported by this proposal will be limited to i, m, and s. These may be potentially changed by future proposals (such as the x-mode proposal), but doing so is out of scope.

Answer 8 · 2024-05-30T15:47:14.000Z

@rbuckton, I know this is already closed (and implemented in V8, yay!), but for interest's sake, note that it's very possible to emulate presence or lack of s and m. I do it in regex-make to locally apply the presence or absence of local flags for RegExp instances interpolated into a template.

m

If you are in u mode, you could emulate non-m mode when in m mode using the proposed \A and \z buffer boundaries. However, if you are not in u mode, there's no way to match the buffer boundaries when in m mode

Emulating is possible without u mode or buffer boundaries.

Emulate an m mode ^: (?<=^|[\n\r\u2028\u2029])
Emulate a non-m mode ^: (?<![^])
Emulate an m mode $: (?=$|[\n\r\u2028\u2029])
Emulate a non-m mode $: (?![^])

s

Its fairly complicated to invert the s flag in a RegExp without modifiers, and easy to get wrong:
[...]
// without 's' modifier
/a.b(?:.|[\r\n\u2028\u2029])+c.d/

[...]
// without 's' modifier
/a.b(?:(?![\r\n\u2028\u2029]).)+c.d/s

It's easier than that:

Emulate an s mode .: [^]
Emulate a non-s mode .: [[^]--[\n\r\u2028\u2029]] (with v) or (?:(?![\n\r\u2028\u2029]).) for a less efficient version without v (same as you showed in the quote).

Note that, like your (?:.|[\r\n\u2028\u2029]) example, [^] either matches full code points or doesn't based on the presence of flag u/v.

Answer 9 · 2024-05-30T19:41:54.000Z

Emulate a non-s mode .: [[^]--[\n\r\u2028\u2029]] (with v) or (?:(?![\n\r\u2028\u2029]).) for a less efficient version without v (same as you showed in the quote).

Or [^\n\r\u2028\u2029].

Answer 10 · 2025-01-10T18:39:42.000Z

FWIW: In the example “Match a frontmatter block at the start of a file” work, line terminators have to be consumed. This works for me:

const re = /(?-m:^)---\r?\n((?:^(?!---$).*\r?\n)*)^---$/m;
assert.equal(re.test('---a'), false);
assert.equal(re.test('---\n---'), true);
assert.equal(
  re.exec('---\n---')[1],
  ''
);
assert.equal(
  re.exec('---\na: b\n---')[1],
  'a: b\n'
);

Footnotes

i

m

s

m

s

`i`

`m`

`s`

`m`

`s`