Nested capture group numbering is inverted
davisjam opened this issue · 6 comments
The regex /(((a)b)c)/
contains 3 groups. Nested groups are numbered left-to-right by left parentheses. So backreference \3
matches "a", backreference \2
matches "ab", and backreference \1
matches "abc".
For example:
(14:43:51) jamie@woody ~/Desktop/EcosystemRegexps/analyze-regexps/structural-analysis $ node
> /(((a)b)c)\1/.exec('abcabc')
[ 'abcabc',
'abc',
'ab',
'a',
index: 0,
input: 'abcabc',
groups: undefined ]
> /(((a)b)c)\2/.exec('abcab')
[ 'abcab',
'abc',
'ab',
'a',
index: 0,
input: 'abcab',
groups: undefined ]
> /(((a)b)c)\3/.exec('abca')
[ 'abca',
'abc',
'ab',
'a',
index: 0,
input: 'abca',
groups: undefined ]
>
I am running regexp-tree v0.0.85.
It looks to me like nested groups are numbered in reverse, with the outermost given the largest number and the innermost given the smallest.
re: /(((a*)))\1/
pattern /(((a*)))\1/ yields AST:
{
"type": "RegExp",
"body": {
"type": "Alternative",
"expressions": [
{
"type": "Group",
"capturing": true,
"number": 3,
"expression": {
"type": "Group",
"capturing": true,
"number": 2,
"expression": {
"type": "Group",
"capturing": true,
"number": 1,
"expression": {
"type": "Repetition",
"expression": {
"type": "Char",
"value": "a",
"kind": "simple",
"symbol": "a",
"codePoint": 97
},
"quantifier": {
"type": "Quantifier",
"kind": "*",
"greedy": true
}
}
}
}
},
{
"type": "Backreference",
"kind": "number",
"number": 1,
"reference": 1
}
]
},
"flags": ""
}
I believe this is due to the recursive nature of the parsing.
regexp.bnf includes:
Capturing Group: ... | L_PAREN Disjunction R_PAREN
{
capturingGroupsCount++
// Create Node
}
Because we resolve the language recursively (I assume?), we update capturingGroupsCount
and create the Node
for inner groups before outer groups.
Given that inner nodes resolve before outer ones, I don't see a good way to address this issue in the parser.
I think the cleanest solution might be to make a second pass over the AST after it is built and do capture group re-numbering at that point.
Thus I would suggest a change in either:
- src/regexp-tree.js:
regexpTree.parse
- src/parser/index.js:
regexpTreeParser.parser
One advantage of doing it in src/regex-tree.js
is that you have access to traverse
at that point.
But since parser
is exported by src/regex-tree.js
perhaps it should be hidden inside the parser.
Here's how I would do left-to-right numbering in src/regexp-tree.js
. This doesn't fix backreferences though, see below.
diff --git a/src/regexp-tree.js b/src/regexp-tree.js
index 15c9844..1d5fcc0 100644
--- a/src/regexp-tree.js
+++ b/src/regexp-tree.js
@@ -51,7 +51,22 @@ const regexpTree = {
* @return Object AST
*/
parse(regexp, options) {
- return parser.parse(`${regexp}`, options);
+ let ast = parser.parse(`${regexp}`, options);
+
+ let nextCaptureGroup = 1;
+ traverse.traverse(ast, {
+ 'Group': {
+ pre({node}) {
+ if (node.capturing) {
+ node.number = nextCaptureGroup;
+ nextCaptureGroup++;
+ }
+ }
+ }
+ }
+ );
+
+ return ast;
This behavior also introduces bugs in the interpretation of backreferences.
Consider:
((a)\2)
In regexp-tree
this backreference would be interpreted as referring to the outer group (see regexp.bnf excerpt below), which would not yet have been defined. So it will be rendered as a Char
instead.
/**
* Parses patterns like \1, \2, etc. either as a backreference
* to a group, or a deciaml char code.
*/
function GroupRefOrDecChar(text, textLoc) {
const reference = Number(text.slice(1));
if (reference > 0 && reference <= capturingGroupsCount) {
return Node({
type: 'Backreference',
kind: 'number',
number: reference,
reference,
}, textLoc);
}
return Char(text, 'decimal', textLoc);
}
Current behavior
Incorrectly interpreting \2
as a not-yet-defined backreference:
(15:41:16) jamie@woody ~/Desktop/floss/regexp-tree $ ./bin/regexp-tree --expression '/((a)\2)/'
{
"type": "RegExp",
"body": {
"type": "Group",
"capturing": true,
"number": 2,
"expression": {
"type": "Alternative",
"expressions": [
{
"type": "Group",
"capturing": true,
"number": 1,
"expression": {
"type": "Char",
"value": "a",
"kind": "simple",
"symbol": "a",
"codePoint": 97
}
},
{
"type": "Char",
"value": "\\2",
"kind": "decimal",
"symbol": "\u0002",
"codePoint": 2
}
]
}
},
"flags": ""
}
And incorrectly identifying the inner group as backreference 1:
$ ./bin/regexp-tree --expression '/((a)\1)/'
{
"type": "RegExp",
"body": {
"type": "Group",
"capturing": true,
"number": 2,
"expression": {
"type": "Alternative",
"expressions": [
{
"type": "Group",
"capturing": true,
"number": 1,
"expression": {
"type": "Char",
"value": "a",
"kind": "simple",
"symbol": "a",
"codePoint": 97
}
},
{
"type": "Backreference",
"kind": "number",
"number": 1,
"reference": 1
}
]
}
},
"flags": ""
}
I don't see a good fix for the "misinterpret \2 as a Char group" unless we just label backreferences as "possible backreference" and fix them during my proposed post-processing pass.
@davisjam, good catches! I'll try to take a look into this soon. Ideally we'll want to avoid double-passes over the AST at parsing level. Hopefully, it'll be possible; if not, yes, we'll need to have the second pass.
Yeah, we'll need to do a post-parse processing, but instead of doing the full pass of the whole AST, we'll just need to patch the capturing Group
.
For this, during the building the AST, we can collector the Group
nodes to some set, and then just patch each node.
Something like this:
const groupNodesToPatch = new Set();
...
Capturing Group: ... | L_PAREN Disjunction R_PAREN
{
capturingGroupsCount++
// 1. Create Node
// 2. Save node to update its index in post-processing
if (node.capturing) {
groupNodesToPatch.add(node);
}
}
And later on:
for (const group of groupNodesToPatch) {
node.number = capturingGroupsCount - groupNodesToPatch.number;
}
I'll try to reach it later, and will appreciate a PR in case you get earlier to it.
Fixed in 6709afd.