DmitrySoshnikov/regexp-tree

Nested capture group numbering is inverted

davisjam opened this issue · 6 comments

The regex /(((a)b)c)/ contains 3 groups. Nested groups are numbered left-to-right by left parentheses. So backreference \3 matches "a", backreference \2 matches "ab", and backreference \1 matches "abc".

For example:

(14:43:51) jamie@woody ~/Desktop/EcosystemRegexps/analyze-regexps/structural-analysis $ node
> /(((a)b)c)\1/.exec('abcabc')
[ 'abcabc',
  'abc',
  'ab',
  'a',
  index: 0,
  input: 'abcabc',
  groups: undefined ]
> /(((a)b)c)\2/.exec('abcab')
[ 'abcab',
  'abc',
  'ab',
  'a',
  index: 0,
  input: 'abcab',
  groups: undefined ]
> /(((a)b)c)\3/.exec('abca')
[ 'abca',
  'abc',
  'ab',
  'a',
  index: 0,
  input: 'abca',
  groups: undefined ]
> 

I am running regexp-tree v0.0.85.

It looks to me like nested groups are numbered in reverse, with the outermost given the largest number and the innermost given the smallest.

re: /(((a*)))\1/
pattern /(((a*)))\1/ yields AST:
{
  "type": "RegExp",
  "body": {
    "type": "Alternative",
    "expressions": [
      {
        "type": "Group",
        "capturing": true,
        "number": 3,
        "expression": {
          "type": "Group",
          "capturing": true,
          "number": 2,
          "expression": {
            "type": "Group",
            "capturing": true,
            "number": 1,
            "expression": {
              "type": "Repetition",
              "expression": {
                "type": "Char",
                "value": "a",
                "kind": "simple",
                "symbol": "a",
                "codePoint": 97
              },
              "quantifier": {
                "type": "Quantifier",
                "kind": "*",
                "greedy": true
              }
            }
          }
        }
      },
      {
        "type": "Backreference",
        "kind": "number",
        "number": 1,
        "reference": 1
      }
    ]
  },
  "flags": ""
}

I believe this is due to the recursive nature of the parsing.

regexp.bnf includes:

Capturing Group: ... | L_PAREN Disjunction R_PAREN
{
  capturingGroupsCount++
  // Create Node
}

Because we resolve the language recursively (I assume?), we update capturingGroupsCount and create the Node for inner groups before outer groups.

Given that inner nodes resolve before outer ones, I don't see a good way to address this issue in the parser.
I think the cleanest solution might be to make a second pass over the AST after it is built and do capture group re-numbering at that point.

Thus I would suggest a change in either:

  • src/regexp-tree.js: regexpTree.parse
  • src/parser/index.js: regexpTreeParser.parser

One advantage of doing it in src/regex-tree.js is that you have access to traverse at that point.
But since parser is exported by src/regex-tree.js perhaps it should be hidden inside the parser.

Here's how I would do left-to-right numbering in src/regexp-tree.js. This doesn't fix backreferences though, see below.

diff --git a/src/regexp-tree.js b/src/regexp-tree.js
index 15c9844..1d5fcc0 100644
--- a/src/regexp-tree.js
+++ b/src/regexp-tree.js
@@ -51,7 +51,22 @@ const regexpTree = {
    * @return Object AST
    */
   parse(regexp, options) {
-    return parser.parse(`${regexp}`, options);
+    let ast = parser.parse(`${regexp}`, options);
+
+    let nextCaptureGroup = 1;
+    traverse.traverse(ast, {
+        'Group': {
+          pre({node}) {
+            if (node.capturing) {
+              node.number = nextCaptureGroup;
+              nextCaptureGroup++;
+            }
+          }
+        }
+      }
+    );
+
+    return ast;

This behavior also introduces bugs in the interpretation of backreferences.

Consider:

((a)\2)

In regexp-tree this backreference would be interpreted as referring to the outer group (see regexp.bnf excerpt below), which would not yet have been defined. So it will be rendered as a Char instead.

/**
 * Parses patterns like \1, \2, etc. either as a backreference
 * to a group, or a deciaml char code.
 */
function GroupRefOrDecChar(text, textLoc) {
  const reference = Number(text.slice(1));

  if (reference > 0 && reference <= capturingGroupsCount) {
    return Node({
      type: 'Backreference',
      kind: 'number',
      number: reference,
      reference,
    }, textLoc);
  }

  return Char(text, 'decimal', textLoc);
}

Current behavior

Incorrectly interpreting \2 as a not-yet-defined backreference:

(15:41:16) jamie@woody ~/Desktop/floss/regexp-tree $ ./bin/regexp-tree --expression '/((a)\2)/'
{
  "type": "RegExp",
  "body": {
    "type": "Group",
    "capturing": true,
    "number": 2,
    "expression": {
      "type": "Alternative",
      "expressions": [
        {
          "type": "Group",
          "capturing": true,
          "number": 1,
          "expression": {
            "type": "Char",
            "value": "a",
            "kind": "simple",
            "symbol": "a",
            "codePoint": 97
          }
        },
        {
          "type": "Char",
          "value": "\\2",
          "kind": "decimal",
          "symbol": "\u0002",
          "codePoint": 2
        }
      ]
    }
  },
  "flags": ""
}

And incorrectly identifying the inner group as backreference 1:

$ ./bin/regexp-tree --expression '/((a)\1)/'
{
  "type": "RegExp",
  "body": {
    "type": "Group",
    "capturing": true,
    "number": 2,
    "expression": {
      "type": "Alternative",
      "expressions": [
        {
          "type": "Group",
          "capturing": true,
          "number": 1,
          "expression": {
            "type": "Char",
            "value": "a",
            "kind": "simple",
            "symbol": "a",
            "codePoint": 97
          }
        },
        {
          "type": "Backreference",
          "kind": "number",
          "number": 1,
          "reference": 1
        }
      ]
    }
  },
  "flags": ""
}

I don't see a good fix for the "misinterpret \2 as a Char group" unless we just label backreferences as "possible backreference" and fix them during my proposed post-processing pass.

@davisjam, good catches! I'll try to take a look into this soon. Ideally we'll want to avoid double-passes over the AST at parsing level. Hopefully, it'll be possible; if not, yes, we'll need to have the second pass.

Yeah, we'll need to do a post-parse processing, but instead of doing the full pass of the whole AST, we'll just need to patch the capturing Group.

For this, during the building the AST, we can collector the Group nodes to some set, and then just patch each node.

Something like this:

const groupNodesToPatch = new Set();
...
Capturing Group: ... | L_PAREN Disjunction R_PAREN
{
  capturingGroupsCount++
  // 1. Create Node
  // 2. Save node to update its index in post-processing
  
  if (node.capturing) {
    groupNodesToPatch.add(node);
  }
}

And later on:

for (const group of groupNodesToPatch) {
  node.number = capturingGroupsCount - groupNodesToPatch.number;
}

I'll try to reach it later, and will appreciate a PR in case you get earlier to it.

I took an approach along these lines in #165. I can't quite see how to make it work, because complete group number inversion is not correct either. Let's move the discussion on a fix to #165.