/proposal-regexp-atomic-operators

Primary LanguageJavaScriptBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Regular Expression Atomic Operators for ECMAScript

This proposal seeks to introduce syntax to ECMAScript regular expressions to control backtracking in certain scenarios by treating certain portions of a pattern as "atomic", where backtracking information specific to that portion of the pattern is discarded when it matches successfully.

For example, currently the pattern a(bc|b)c will match both "abcc" and "abc". In the case of "abcc", we match the term a, then the alternative bc, and finally the term c. In the case of "abc", we match the term a, then the alternative bc (consuming the rest of the input), but fail to match the final term c. Since the rest of the match failed, we backtrack to the beginning of the disjunction bc|b and attempt the alternative. We then match the alternative b, and finally the term c.

In some cases such backtracking is unwanted or is potentially catastrophic. To address such cases, we propose the addition of Atomic Groups ((?> Disjunction )) and Possessive Quantifiers (n*+, n++, etc.). In an Atomic Group we discard backtracking information for the group when it matches successfully. If the rest of the pattern following the atomic group fails to match, we would no longer backtrack to the beginning of the group in an attempt to process any alternatives.

In the case of a(?>bc|b)c, we would still match "abcc" but would no longer match "abc". In the case of "abc", we match the term a, then the alternative bc, but fail to match the term c. Since backtracking information for the disjunction bc|b was discarded upon the successful match of bc, we cannot attempt the alternative b. As a result, the match for "abc" fails.

It is worth noting that lookaround operators (i.e., (?=), (?!), (?<=), and (?<!)) are already atomic operations.

Status

Stage: 1
Champion: Ron Buckton (@rbuckton)

For detailed status of this proposal see TODO, below.

Authors

Motivations

NOTE: See https://github.com/rbuckton/proposal-regexp-features for an overview of how this proposal fits into other possible future features for Regular Expressions.

This proposal seeks to introduce new syntax to allow users fine control the performance characteristics of regular expressions through possessive quantifiers and backtracking control.

Prior Art

See https://rbuckton.github.io/regexp-features/features/possessive-quantifiers.html and https://rbuckton.github.io/regexp-features/features/non-backtracking-expressions.html for additional information.

Syntax

Atomic Groups

An Atomic Group is a non-backtracking expression which is matched independent of neighboring patterns, and will not backtrack in the event of a failed match. This is often used to improve performance.

  • (?>pattern) — Matches the provided pattern, but no backtracking is performed if the match fails.

NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both u and non-u modes.

Possessive Quantifiers

Possessive quantifiers are like normal (a.k.a. "greedy") quantifiers, but do not backtrack if the rest of the pattern to the right fails to match. Possessive quantifiers are often used as a performance tweak to avoid catastrophic backtracking.

  • *+ — Match zero or more instances of the preceding atom without backtracking.
  • ++ — Match one or more instances of the preceding atom without backtracking.
  • ?+ — Match zero or one instances of the preceding atom without backtracking.
  • {n}+ — Where n is an integer. Matches the preceding atom exactly n times without backtracking.
  • {n,}+ — Where n is an integer. Matches the preceding atom at-least n times without backtracking.
  • {n,m}+ — Where n and m are integers, and m >= n. Matches the preceding atom at-least n times and at-most m times without backtracking.

NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both u and non-u modes.

Possessive quantifiers are syntactic sugar over (?> Disjunction ):

  • atom*+ is equivalent to (?>atom*)
  • atom++ is equivalent to (?>atom+)
  • atom?+ is equivalent to (?>atom?)
  • atom{n}+ is equivalent to (?>atom{n})
  • atom{n,}+ is equivalent to (?>atom{n,})
  • atom{n,m}+ is equivalent to (?>atom{n,m})

Examples

// NOTE: x-mode flag used to illustrate difference
// without atomic groups:
const re1 = /\((      [^()]+   | \([^()]*\))+ \)/x;
re1.test("((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"); // can take several seconds to fail

// with atomic groups
const re2 = /\((  (?> [^()]+ ) | \([^()]*\))+ \)/x;
//                ^^^--------^ atomic group
re2.test("((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"); // significantly faster as less backtracking is involved

// with posessive quantifiers
const re2 = /\((      [^()]++  | \([^()]*\))+ \)/x;
//                         ^^- possessive quantifier
re2.test("((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"); // significantly faster as less backtracking is involved

See https://github.com/rbuckton/proposal-regexp-x-mode for more information on the x flag.

History

  • October 28, 2021 — Proposed for Stage 1 (slides)
    • Outcome: Remained at Stage 0

TODO

The following is a high-level list of tasks to progress through each stage of the TC39 proposal process:

Stage 1 Entrance Criteria

  • Identified a "champion" who will advance the addition.
  • Prose outlining the problem or need and the general shape of a solution.
  • Illustrative examples of usage.
  • High-level API.

Stage 2 Entrance Criteria

Stage 3 Entrance Criteria

Stage 4 Entrance Criteria

  • Test262 acceptance tests have been written for mainline usage scenarios and merged.
  • Two compatible implementations which pass the acceptance tests: [1], [2].
  • A pull request has been sent to tc39/ecma262 with the integrated spec text.
  • The ECMAScript editor has signed off on the pull request.