tc39/proposal-regexp-modifiers

Supported Modifier Flags

rbuckton opened this issue · 10 comments

In the Oct, 2021 plenary, @michaelficarra asked that we outline and provide motivating examples for each flag we are considering as a supported modifier.

The flags currently under consideration are:

  • i — ignore-case
    • Rationale — Toggling ignore-case is especially useful when matching patterns with varying case sensitivity, or when parsing patterns provided via JSON configuration. Especially useful when working with complex Unicode character ranges.
    • Example — Match upper case ascii letter followed by upper or lower case ascii letter or '
      const re = /^[A-Z](?i)[a-z']+$/;
      re.test("O'Neill"); // true
      re.test("o'neill"); // false
      
      // alternatively (defaulting to ignore-case):
      const re2 = /^(?-i:[A-Z])[a-z']+$/i;
    • Example — Match word starting with D followed by word starting with D or d (from .NET documentation, see 1)
      const re = /\b(D\w+)(?ix)\s(d\w+)\b/g;
      const input = "double dare double Double a Drooling dog The Dreaded Deep";
      re.exec(input); // ["Drooling dog", "Drooling", "dog"]
      re.exec(input); // ["Dreaded Deep", "Dreaded", "Deep"]
  • m — multiline
    • Rationale — Flexibility in matching beginning-of-buffer vs. beginning-of-line or end-of-buffer vs. end-of-line in a complex pattern.
    • Example — Match a frontmatter block at the start of a file
      const re = /^---(?m)$((?:^(?!---$).*$)*)^---$/;
      re.test("---a"); // false
      re.test("---\n---"); // true
      re.test("---\na: b\n---"); // true
  • s — dot-all (i.e., "single line")
    • Rationale — Control over . matching semantics within a pattern.
    • Example
      const re = /a.c(?s:.)*x.z/;
      re.test("a\ncx\nz"); // flse
      re.test("abcdxyz"); // true
      re.test("aBc\nxYz"); // true
  • x — Extended Mode. This flag is proposed by https://github.com/tc39/proposal-regexp-x-mode
    • Rationale — Would allow control over significant whitespace handling in a pattern.
    • Example — Disabling x mode when composing a complex pattern:
      const idPattern = `[a-z]{2} \d{4}`; // space required
      const re = new RegExp(String.raw`
        # match the id
        (?<id>(?-x:${idPattern}))
        
        # match a separator
        :\s
        
        # match the value
        (?<value>\w+)
      `, "x");
      
      re.exec("aa0123: foo")?.groups; // undefined
      re.exec("aa 0123: foo")?.groups; // { id: "aa 0123", value: "foo" }

Flags likely too complex to support:

  • u — Unicode. This flag affects how a pattern is parsed, not how it is matched. Supporting it would likely require a cover grammar and additional static semantics.
  • v — Extended Unicode. This flag is proposed by https://github.com/tc39/proposal-regexp-set-notation as an extension of the u flag and would have the same difficulties.

Flags that will never be supported:

  • g — Global. This flag affects the index at which matching starts and not the matching behavior itself. Changing it mid pattern would have no effect.
  • y — Sticky. This flag affects the index at which matching starts and not the matching behavior itself. Changing it mid pattern would have no effect.
  • d — Indices. This flag affects the match result. Changing it mid pattern would have no effect.

Footnotes

  1. https://docs.microsoft.com/en-us/dotnet/standard/base-types/miscellaneous-constructs-in-regular-expressions#inline-options

For the examples, can you share how you'd do it without the relevant proposal?

i

Simple cases like /[A-Z][A-Za-z]/ are trivial:

// match an uppercase ASCII letter followed by a mixed-case ASCII letter

// with 'i' modifier:
/[A-Z](?i)[A-Z]/

// without 'i' modifier:
/[A-Z][A-Za-z]/

However, more complex cases are far from trivial:

// match a mixed case "hello" followed by the exact characters "World"

// with 'i' modifier:
/(?i:hello) World/

// without 'i' modifier:
/[Hh][Ee][Ll][Ll][Oo] World/

m

If you are in u mode, you could emulate non-m mode when in m mode using the proposed \A and \z buffer boundaries. However, if you are not in u mode, there's no way to match the buffer boundaries when in m mode:

// with 'm' modifier:
/^---(?m)$((?:^(?!---$).*$)*)^---$/

// without the 'm' modifier, in 'u' mode:
/\A---$((?:^(?!---$).*$)*)^---$/mu

// without the 'm' modifier, not in 'u' mode: not possible to invert when in 'm' mode

s

Its fairly complicated to invert the s flag in a RegExp without modifiers, and easy to get wrong:

// match /a.b/ outside of 's' mode, then /.+/ in 's' mode, then /c.d/ outside of 's' mode
// with 's' modifier
/a.b(?s:.)+c.d/

// without 's' modifier
/a.b(?:.|[\r\n\u2028\u2029])+c.d/

// match /a.b/ inside of 's' mode, then /.+/ outside of 's' mode, then /c.d/ inside of 's' mode
// with 's' modifier
/a.b(?-s:.+)c.d/s

// without 's' modifier
/a.b(?:(?![\r\n\u2028\u2029]).)+c.d/s

There's nothing with [^\s\S] for the dotAll case?

I'm not sure I understand what you mean. Can you clarify?

If you mean using [\s\S] to match everything, that's feasible for the first s example, sure. I don't see how it helps with the second example though.

I just want to share a little trick to emulate m and non-m mode without using ^ and $. This might be relevant for transpilers.

- /^ $/ == /(?<![\s\S]) (?![\s\S])/
- /^ $/m == /(?<!.) (?!.)/ // no `s` flag!

This works for both u and non-u mode.

The modifiers supported by this proposal will be limited to i, m, and s. These may be potentially changed by future proposals (such as the x-mode proposal), but doing so is out of scope.

@rbuckton, I know this is already closed (and implemented in V8, yay!), but for interest's sake, note that it's very possible to emulate presence or lack of s and m. I do it in regex-make to locally apply the presence or absence of local flags for RegExp instances interpolated into a template.

m

If you are in u mode, you could emulate non-m mode when in m mode using the proposed \A and \z buffer boundaries. However, if you are not in u mode, there's no way to match the buffer boundaries when in m mode

Emulating is possible without u mode or buffer boundaries.

  • Emulate an m mode ^: (?<=^|[\n\r\u2028\u2029])
  • Emulate a non-m mode ^: (?<![^])
  • Emulate an m mode $: (?=$|[\n\r\u2028\u2029])
  • Emulate a non-m mode $: (?![^])

s

Its fairly complicated to invert the s flag in a RegExp without modifiers, and easy to get wrong:

[...]
// without 's' modifier
/a.b(?:.|[\r\n\u2028\u2029])+c.d/

[...]
// without 's' modifier
/a.b(?:(?![\r\n\u2028\u2029]).)+c.d/s

It's easier than that:

  • Emulate an s mode .: [^]
  • Emulate a non-s mode .: [[^]--[\n\r\u2028\u2029]] (with v) or (?:(?![\n\r\u2028\u2029]).) for a less efficient version without v (same as you showed in the quote).

Note that, like your (?:.|[\r\n\u2028\u2029]) example, [^] either matches full code points or doesn't based on the presence of flag u/v.

  • Emulate a non-s mode .: [[^]--[\n\r\u2028\u2029]] (with v) or (?:(?![\n\r\u2028\u2029]).) for a less efficient version without v (same as you showed in the quote).

Or [^\n\r\u2028\u2029].

FWIW: In the example “Match a frontmatter block at the start of a file” work, line terminators have to be consumed. This works for me:

const re = /(?-m:^)---\r?\n((?:^(?!---$).*\r?\n)*)^---$/m;
assert.equal(re.test('---a'), false);
assert.equal(re.test('---\n---'), true);
assert.equal(
  re.exec('---\n---')[1],
  ''
);
assert.equal(
  re.exec('---\na: b\n---')[1],
  'a: b\n'
);