Support for negative lookahead groups and recursive capturing.
TylerHelmuth opened this issue · 12 comments
I have a situation where my grammar needs to differential between capturing an individual value and capturing values as the same type within a mathematical equation. My struct looks like this:
type value struct {
Invocation *invocation `parser:"( @@"`
Path *Path `parser:"| @@"`
Float *float64 `parser:"| @Float"`
Int *int64 `parser:"| @Int"`
MathExpression *mathExpression `parser:"| @@"`
IsNil *isNil `parser:"| @'nil'"`
Bytes *byteSlice `parser:"| @Bytes"`
String *string `parser:"| @String"`
Bool *boolean `parser:"| @Boolean"`
Enum *EnumSymbol `parser:"| @Uppercase"`
List *list `parser:"| @@ )"`
}
Deep down inside the terms and factors of MathExpression
it can capture a Invocation
, Path
, Float
, and Int
. At the moment if I try to parse something like set(x, 1 + 1)
, the grammar sees the 1
, matches it with @Int
, and then doesn't know how to handle the unexpected symbol +
.
I can solve this problem for Float
and Int
by using negative lookahead:
type value struct {
Invocation *invocation `parser:"( @@"`
Path *Path `parser:"| @@"`
Float *float64 `parser:"| @Float (?! @OpAddSub | @OpMultDiv)"`
Int *int64 `parser:"| @Int (?! @OpAddSub | @OpMultDiv)"`
MathExpression *mathExpression `parser:"| @@"`
IsNil *isNil `parser:"| @'nil'"`
Bytes *byteSlice `parser:"| @Bytes"`
String *string `parser:"| @String"`
Bool *boolean `parser:"| @Boolean"`
Enum *EnumSymbol `parser:"| @Uppercase"`
List *list `parser:"| @@ )"`
}
Now set(x, 1 + 1)
get parsed correctly. But the solution breaks down when I try the same strategy with Path
and Invocation
.
If I try to parse set(x, y + increment())
without negative lookahead then the Path, y
, is matched to Path
and the parser doesn't know how to handle the +
. But if I add negative lookahead like this:
type value struct {
Invocation *invocation `parser:"( @@ (?! @OpAddSub | @OpMultDiv)"`
Path *Path `parser:"| @@ (?! @OpAddSub | @OpMultDiv)"`
Float *float64 `parser:"| @Float (?! @OpAddSub | @OpMultDiv)"`
Int *int64 `parser:"| @Int (?! @OpAddSub | @OpMultDiv)"`
MathExpression *mathExpression `parser:"| @@"`
IsNil *isNil `parser:"| @'nil'"`
Bytes *byteSlice `parser:"| @Bytes"`
String *string `parser:"| @String"`
Bool *boolean `parser:"| @Boolean"`
Enum *EnumSymbol `parser:"| @Uppercase"`
List *list `parser:"| @@ )"`
}
Then I get the following error when building the lexer: Invocation: Arguments: Invocation: structs can only be parsed with @@ or by implementing the Capture or encoding.TextUnmarshaler interfaces
Is there a way to support recursively matching a struct but only when the negative lookahead group is satisfied?
Full grammar can be found here: https://github.com/TylerHelmuth/opentelemetry-collector-contrib/blob/9f074d641c6c6b17b1d4774194ccbfa271f8a0da/pkg/ottl/grammar.go
Nevermind, I just needed to not include @
in the lookahead group.
Nevermind.
type value struct {
Invocation *invocation `parser:"( @@ (?! OpAddSub | OpMultDiv)"`
IsNil *isNil `parser:"| @'nil'"`
Path *Path `parser:"| @@ (?! OpAddSub | OpMultDiv)"`
Float *float64 `parser:"| @Float (?! OpAddSub | OpMultDiv)"`
Int *int64 `parser:"| @Int (?! OpAddSub | OpMultDiv)"`
MathExpression *mathExpression `parser:"| @@"`
Bytes *byteSlice `parser:"| @Bytes"`
String *string `parser:"| @String"`
Bool *boolean `parser:"| @Boolean"`
Enum *EnumSymbol `parser:"| @Uppercase"`
List *list `parser:"| @@ )"`
}
doesn't panic, but it also doesn't work as expect. Something like one() + 3
still matches Invocation
instead of MathExpression
.
Interestingly the solution is working for Path
. Maybe the issue is that Invocation
references zero or more values
?
Can you provide a minimal, working, reproducible case please?
@alecthomas here is an example:
type A struct {
B *B `parser:"(@@ (?! '+')"`
C *C `parser:"| @@)"`
}
type B struct {
Val *string `parser:"@Uppercase"`
LParen *string `parser:"@LParen"`
RParen *string `parser:"@RParen"`
}
type C struct {
LHS *B `parser:"@@"`
Op *string `parser:"@'+'"`
RHS *B `parser:"@@"`
}
func buildLexer() *lexer.StatefulDefinition {
return lexer.MustSimple([]lexer.SimpleRule{
{Name: `String`, Pattern: `"(\\"|[^"])*"`},
{Name: `OpAdd`, Pattern: `\+`},
{Name: `LParen`, Pattern: `\(`},
{Name: `RParen`, Pattern: `\)`},
{Name: `Uppercase`, Pattern: `[A-Z_][A-Z0-9_]*`},
{Name: "whitespace", Pattern: `\s+`},
})
}
parser, _ := participle.Build[A](
participle.Lexer(buildLexer()),
participle.Unquote("String"),
participle.Elide("whitespace"),
)
parsed, err := parser.ParseString("", "ONE() + TWO()")
In this example the negative lookahead will not work and the parser will fail on +
after parsing ONE()
into A.B
. Interestingly, if I remove LParen
and RParen
from B
and parse the string ONE + TWO
then the parser correctly parses the result into C
.
Thanks!
This is due to the default lookahead of the parser only being one token. You can solve this with participle.UseLookahead(N)
. Depending on the complexity of the expression, N might need to be very large.
@alecthomas I was definitely missing that setting, and that fixed my stripped down example, but it didn't solve my real-life example. I've tweaked my example to reproduce the issue even with UseLookahead
. I've pinpointed the issue (I think) to performing multiple negative lookahead with structs and/or something conflicting between the way TestInvocation
and TestPath
use uppercase and lowercase tokens:
type TestInvocation struct {
Function string `parser:"@(Uppercase | Lowercase)+"`
Arguments []TestValue `parser:"'(' ( @@ ( ',' @@ )* )? ')'"`
}
type TestPath struct {
Name string `parser:"@Lowercase ( '.' @Lowercase )*"`
}
type TestValue struct {
Invocation *TestInvocation `parser:"( @@ (?! OpAddSub)"`
Path *TestPath `parser:"| @@ (?! OpAddSub)"`
MathExpression *Expression `parser:"| @@ )"`
}
type MathValue struct {
Invocation *TestInvocation `parser:"( @@"`
Path *TestPath `parser:"| @@"`
Number *float64 `parser:"| @Float"`
Subexpression *Expression `parser:"| '(' @@ ')' )"`
}
type Expression struct {
Left *MathValue `parser:"@@"`
Operator *string `parser:"@OpAddSub"`
Right *MathValue `parser:"@@"`
}
func Test_issue(t *testing.T) {
tests := []struct {
name string
input string
}{
{
name: "test",
input: "one() + two()",
},
}
parser, _ := participle.Build[TestValue](
participle.Lexer(lexer.MustSimple([]lexer.SimpleRule{
{Name: `Float`, Pattern: `[-+]?\d*\.\d+([eE][-+]?\d+)?`},
{Name: `String`, Pattern: `"(\\"|[^"])*"`},
{Name: `OpAddSub`, Pattern: `\+|\-`},
{Name: `LParen`, Pattern: `\(`},
{Name: `RParen`, Pattern: `\)`},
{Name: `Punct`, Pattern: `[,.\[\]]`},
{Name: `Uppercase`, Pattern: `[A-Z_][A-Z0-9_]*`},
{Name: `Lowercase`, Pattern: `[a-z_][a-z0-9_]*`},
{Name: "whitespace", Pattern: `\s+`},
})),
participle.Unquote("String"),
participle.Elide("whitespace"),
participle.UseLookahead(-1),
)
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
var trace strings.Builder
parsed, err := parser.ParseString("", tt.input, participle.Trace(&trace))
//fmt.Println(trace.String())
assert.NoError(t, err)
assert.NotNil(t, parsed)
})
}
}
In this example, which is closer to the complexity of my real grammar, I am unable to parse one() + two()
into a MathExpression
correctly. Instead, Invocation
is skipped correctly (I think) and one
is parsed into TestPath
and then I get this error:
Error: Received unexpected error:
1:7: unexpected '+'
What is strange is that I don't know what consumers the ()
. Here is the printout of a trace:
Trace
"ONE" TestValue
"ONE" group{n}
"ONE" disjunction{}
"ONE" sequence{}
"ONE" capture{}
"ONE" TestInvocation
"ONE" sequence{}
"ONE" capture{}
"ONE" reference{Uppercase}
"(" literal{"(", "EOF"}
")" group{n?}
")" group{n}
")" sequence{}
")" capture{}
")" TestValue
")" group{n}
")" disjunction{}
")" sequence{}
")" capture{}
")" TestInvocation
")" sequence{}
")" capture{}
")" reference{Uppercase}
")" sequence{}
")" capture{}
")" TestPath
")" sequence{}
")" capture{}
")" reference{Lowercase}
")" capture{}
")" Expression
")" sequence{}
")" capture{}
")" MathValue
")" group{n}
")" disjunction{}
")" capture{}
")" TestInvocation
")" sequence{}
")" capture{}
")" reference{Uppercase}
")" capture{}
")" TestPath
")" sequence{}
")" capture{}
")" reference{Lowercase}
")" capture{}
")" reference{Float}
")" sequence{}
")" literal{"(", "EOF"}
")" literal{")", "EOF"}
"+" lookaheadGroup{}
"+" reference{OpAddSub}
"ONE" sequence{}
"ONE" capture{}
"ONE" TestPath
"ONE" sequence{}
"ONE" capture{}
"ONE" reference{Lowercase}
"ONE" capture{}
"ONE" Expression
"ONE" sequence{}
"ONE" capture{}
"ONE" MathValue
"ONE" group{n}
"ONE" disjunction{}
"ONE" capture{}
"ONE" TestInvocation
"ONE" sequence{}
"ONE" capture{}
"ONE" reference{Uppercase}
"(" literal{"(", "EOF"}
")" group{n?}
")" group{n}
")" sequence{}
")" capture{}
")" TestValue
")" group{n}
")" disjunction{}
")" sequence{}
")" capture{}
")" TestInvocation
")" sequence{}
")" capture{}
")" reference{Uppercase}
")" sequence{}
")" capture{}
")" TestPath
")" sequence{}
")" capture{}
")" reference{Lowercase}
")" capture{}
")" Expression
")" sequence{}
")" capture{}
")" MathValue
")" group{n}
")" disjunction{}
")" capture{}
")" TestInvocation
")" sequence{}
")" capture{}
")" reference{Uppercase}
")" capture{}
")" TestPath
")" sequence{}
")" capture{}
")" reference{Lowercase}
")" capture{}
")" reference{Float}
")" sequence{}
")" literal{"(", "EOF"}
")" literal{")", "EOF"}
"+" capture{}
"+" reference{OpAddSub}
"TWO" capture{}
"TWO" MathValue
"TWO" group{n}
"TWO" disjunction{}
"TWO" capture{}
"TWO" TestInvocation
"TWO" sequence{}
"TWO" capture{}
"TWO" reference{Uppercase}
"(" literal{"(", "EOF"}
")" group{n?}
")" group{n}
")" sequence{}
")" capture{}
")" TestValue
")" group{n}
")" disjunction{}
")" sequence{}
")" capture{}
")" TestInvocation
")" sequence{}
")" capture{}
")" reference{Uppercase}
")" sequence{}
")" capture{}
")" TestPath
")" sequence{}
")" capture{}
")" reference{Lowercase}
")" capture{}
")" Expression
")" sequence{}
")" capture{}
")" MathValue
")" group{n}
")" disjunction{}
")" capture{}
")" TestInvocation
")" sequence{}
")" capture{}
")" reference{Uppercase}
")" capture{}
")" TestPath
")" sequence{}
")" capture{}
")" reference{Lowercase}
")" capture{}
")" reference{Float}
")" sequence{}
")" literal{"(", "EOF"}
")" literal{")", "EOF"}
A negative lookahead isn't really the right tool for parsing an expression, if that's what you're trying to do? Take a look at the various "expr" examples in the _examples
folder.
@alecthomas ya I've got the expression built and working as expected in isolation. I've stripped down the example to the root cause; I have added all the expression parsing rules (terms, factors, etc), and it doesn't affect the outcome.
The problem arises when trying to use an expression and individual values. Sometime I want to be able to capture an individual float or Function or variable. Other times I want the parser to recognize that the float or function or variable is the start of an expression.
This would allow the parser to parse a string like functionName(2)
and functionName(one()+one())
and capture the 2
into a float and one()+one()
into an expression.
This is tricky bc the individual values I want to capture can also show up on the very most left side of an expression.
I could solve this problem by not allowing individual values and instead always capturing those types in an Expression. I could then check after parsing whether or not an operator was found and make appropriate decisions.
I could also handle this issue by forcing Expressions to be wrapped in ()
.
If I can avoid either of those options, though, it would be nice. Annoyingly my negative lookahead tactic works as I want it to for float and 1 struct, but not multiple structs.
Traditionally a value is a leaf node in an expression, and everything just "falls out" of that correctly.
Here's a full example of a fairly complex expression parser. Ignore the custom parsing functions, they're just to clean up the expression tree a bit.
I was able to solve the problem with this change
type TestValue struct {
Literal *Literal `parser:"( @@ (?! OpAddSub)"`
MathExpression *Expression `parser:"| @@ )"`
}
type Literal struct {
Invocation *TestInvocation `parser:"( @@"`
Path *TestPath `parser:"| @@"`
Number *float64 `parser:"| @Float )"`
}
Thanks!