alecthomas/participle

Support for negative lookahead groups and recursive capturing.

TylerHelmuth opened this issue · 12 comments

I have a situation where my grammar needs to differential between capturing an individual value and capturing values as the same type within a mathematical equation. My struct looks like this:

type value struct {
	Invocation     *invocation     `parser:"( @@"`
	Path           *Path           `parser:"| @@"`
	Float          *float64        `parser:"| @Float"`
	Int            *int64          `parser:"| @Int"`
	MathExpression *mathExpression `parser:"| @@"`
	IsNil          *isNil          `parser:"| @'nil'"`
	Bytes          *byteSlice      `parser:"| @Bytes"`
	String         *string         `parser:"| @String"`
	Bool           *boolean        `parser:"| @Boolean"`
	Enum           *EnumSymbol     `parser:"| @Uppercase"`
	List           *list           `parser:"| @@ )"`
}

Deep down inside the terms and factors of MathExpression it can capture a Invocation, Path, Float, and Int. At the moment if I try to parse something like set(x, 1 + 1), the grammar sees the 1, matches it with @Int, and then doesn't know how to handle the unexpected symbol +.

I can solve this problem for Float and Int by using negative lookahead:

type value struct {
	Invocation     *invocation     `parser:"( @@"`
	Path           *Path           `parser:"| @@"`
	Float          *float64        `parser:"| @Float (?! @OpAddSub | @OpMultDiv)"`
	Int            *int64          `parser:"| @Int (?! @OpAddSub | @OpMultDiv)"`
	MathExpression *mathExpression `parser:"| @@"`
	IsNil          *isNil          `parser:"| @'nil'"`
	Bytes          *byteSlice      `parser:"| @Bytes"`
	String         *string         `parser:"| @String"`
	Bool           *boolean        `parser:"| @Boolean"`
	Enum           *EnumSymbol     `parser:"| @Uppercase"`
	List           *list           `parser:"| @@ )"`
}

Now set(x, 1 + 1) get parsed correctly. But the solution breaks down when I try the same strategy with Path and Invocation.

If I try to parse set(x, y + increment()) without negative lookahead then the Path, y, is matched to Path and the parser doesn't know how to handle the +. But if I add negative lookahead like this:

type value struct {
	Invocation     *invocation     `parser:"( @@ (?! @OpAddSub | @OpMultDiv)"`
	Path           *Path           `parser:"| @@ (?! @OpAddSub | @OpMultDiv)"`
	Float          *float64        `parser:"| @Float (?! @OpAddSub | @OpMultDiv)"`
	Int            *int64          `parser:"| @Int (?! @OpAddSub | @OpMultDiv)"`
	MathExpression *mathExpression `parser:"| @@"`
	IsNil          *isNil          `parser:"| @'nil'"`
	Bytes          *byteSlice      `parser:"| @Bytes"`
	String         *string         `parser:"| @String"`
	Bool           *boolean        `parser:"| @Boolean"`
	Enum           *EnumSymbol     `parser:"| @Uppercase"`
	List           *list           `parser:"| @@ )"`
}

Then I get the following error when building the lexer: Invocation: Arguments: Invocation: structs can only be parsed with @@ or by implementing the Capture or encoding.TextUnmarshaler interfaces

Is there a way to support recursively matching a struct but only when the negative lookahead group is satisfied?

Full grammar can be found here: https://github.com/TylerHelmuth/opentelemetry-collector-contrib/blob/9f074d641c6c6b17b1d4774194ccbfa271f8a0da/pkg/ottl/grammar.go

Nevermind, I just needed to not include @ in the lookahead group.

Nevermind.

type value struct {
	Invocation     *invocation     `parser:"( @@ (?! OpAddSub | OpMultDiv)"`
	IsNil          *isNil          `parser:"| @'nil'"`
	Path           *Path           `parser:"| @@ (?! OpAddSub | OpMultDiv)"`
	Float          *float64        `parser:"| @Float (?! OpAddSub | OpMultDiv)"`
	Int            *int64          `parser:"| @Int (?! OpAddSub | OpMultDiv)"`
	MathExpression *mathExpression `parser:"| @@"`
	Bytes          *byteSlice      `parser:"| @Bytes"`
	String         *string         `parser:"| @String"`
	Bool           *boolean        `parser:"| @Boolean"`
	Enum           *EnumSymbol     `parser:"| @Uppercase"`
	List           *list           `parser:"| @@ )"`
}

doesn't panic, but it also doesn't work as expect. Something like one() + 3 still matches Invocation instead of MathExpression.

Interestingly the solution is working for Path. Maybe the issue is that Invocation references zero or more values?

Can you provide a minimal, working, reproducible case please?

@alecthomas here is an example:

type A struct {
	B *B `parser:"(@@ (?! '+')"`
	C *C `parser:"| @@)"`
}

type B struct {
	Val    *string `parser:"@Uppercase"`
	LParen *string `parser:"@LParen"`
	RParen *string `parser:"@RParen"`
}

type C struct {
	LHS *B      `parser:"@@"`
	Op  *string `parser:"@'+'"`
	RHS *B      `parser:"@@"`
}

func buildLexer() *lexer.StatefulDefinition {
	return lexer.MustSimple([]lexer.SimpleRule{
		{Name: `String`, Pattern: `"(\\"|[^"])*"`},
		{Name: `OpAdd`, Pattern: `\+`},
		{Name: `LParen`, Pattern: `\(`},
		{Name: `RParen`, Pattern: `\)`},
		{Name: `Uppercase`, Pattern: `[A-Z_][A-Z0-9_]*`},
		{Name: "whitespace", Pattern: `\s+`},
	})
}

parser, _ := participle.Build[A](
    participle.Lexer(buildLexer()),
    participle.Unquote("String"),
    participle.Elide("whitespace"),
)

parsed, err := parser.ParseString("", "ONE() + TWO()")

In this example the negative lookahead will not work and the parser will fail on + after parsing ONE() into A.B. Interestingly, if I remove LParen and RParen from B and parse the string ONE + TWO then the parser correctly parses the result into C.

Thanks!

This is due to the default lookahead of the parser only being one token. You can solve this with participle.UseLookahead(N). Depending on the complexity of the expression, N might need to be very large.

@alecthomas I was definitely missing that setting, and that fixed my stripped down example, but it didn't solve my real-life example. I've tweaked my example to reproduce the issue even with UseLookahead. I've pinpointed the issue (I think) to performing multiple negative lookahead with structs and/or something conflicting between the way TestInvocation and TestPath use uppercase and lowercase tokens:

type TestInvocation struct {
	Function  string      `parser:"@(Uppercase | Lowercase)+"`
	Arguments []TestValue `parser:"'(' ( @@ ( ',' @@ )* )? ')'"`
}

type TestPath struct {
	Name string `parser:"@Lowercase ( '.' @Lowercase )*"`
}

type TestValue struct {
	Invocation     *TestInvocation `parser:"( @@ (?! OpAddSub)"`
	Path           *TestPath       `parser:"| @@ (?! OpAddSub)"`
	MathExpression *Expression     `parser:"| @@ )"`
}

type MathValue struct {
	Invocation    *TestInvocation `parser:"( @@"`
	Path          *TestPath       `parser:"| @@"`
	Number        *float64        `parser:"| @Float"`
	Subexpression *Expression     `parser:"| '(' @@ ')' )"`
}

type Expression struct {
	Left     *MathValue `parser:"@@"`
	Operator *string    `parser:"@OpAddSub"`
	Right    *MathValue `parser:"@@"`
}

func Test_issue(t *testing.T) {
	tests := []struct {
		name     string
		input    string
	}{
		{
			name:     "test",
			input:    "one() + two()",
		},
	}

	parser, _ := participle.Build[TestValue](
		participle.Lexer(lexer.MustSimple([]lexer.SimpleRule{
			{Name: `Float`, Pattern: `[-+]?\d*\.\d+([eE][-+]?\d+)?`},
			{Name: `String`, Pattern: `"(\\"|[^"])*"`},
			{Name: `OpAddSub`, Pattern: `\+|\-`},
			{Name: `LParen`, Pattern: `\(`},
			{Name: `RParen`, Pattern: `\)`},
			{Name: `Punct`, Pattern: `[,.\[\]]`},
			{Name: `Uppercase`, Pattern: `[A-Z_][A-Z0-9_]*`},
			{Name: `Lowercase`, Pattern: `[a-z_][a-z0-9_]*`},
			{Name: "whitespace", Pattern: `\s+`},
		})),
		participle.Unquote("String"),
		participle.Elide("whitespace"),
		participle.UseLookahead(-1),
	)

	for _, tt := range tests {
		t.Run(tt.name, func(t *testing.T) {
			var trace strings.Builder
			parsed, err := parser.ParseString("", tt.input, participle.Trace(&trace))
			//fmt.Println(trace.String())
			assert.NoError(t, err)
			assert.NotNil(t, parsed)
		})
	}
}

In this example, which is closer to the complexity of my real grammar, I am unable to parse one() + two() into a MathExpression correctly. Instead, Invocation is skipped correctly (I think) and one is parsed into TestPath and then I get this error:

Error:      	Received unexpected error:
        	            	1:7: unexpected '+'

What is strange is that I don't know what consumers the (). Here is the printout of a trace:

Trace

"ONE" TestValue
  "ONE" group{n}
    "ONE" disjunction{}
      "ONE" sequence{}
        "ONE" capture{}
          "ONE" TestInvocation
            "ONE" sequence{}
              "ONE" capture{}
                "ONE" reference{Uppercase}
              "(" literal{"(", "EOF"}
              ")" group{n?}
                ")" group{n}
                  ")" sequence{}
                    ")" capture{}
                      ")" TestValue
                        ")" group{n}
                          ")" disjunction{}
                            ")" sequence{}
                              ")" capture{}
                                ")" TestInvocation
                                  ")" sequence{}
                                    ")" capture{}
                                      ")" reference{Uppercase}
                            ")" sequence{}
                              ")" capture{}
                                ")" TestPath
                                  ")" sequence{}
                                    ")" capture{}
                                      ")" reference{Lowercase}
                            ")" capture{}
                              ")" Expression
                                ")" sequence{}
                                  ")" capture{}
                                    ")" MathValue
                                      ")" group{n}
                                        ")" disjunction{}
                                          ")" capture{}
                                            ")" TestInvocation
                                              ")" sequence{}
                                                ")" capture{}
                                                  ")" reference{Uppercase}
                                          ")" capture{}
                                            ")" TestPath
                                              ")" sequence{}
                                                ")" capture{}
                                                  ")" reference{Lowercase}
                                          ")" capture{}
                                            ")" reference{Float}
                                          ")" sequence{}
                                            ")" literal{"(", "EOF"}
              ")" literal{")", "EOF"}
        "+" lookaheadGroup{}
          "+" reference{OpAddSub}
      "ONE" sequence{}
        "ONE" capture{}
          "ONE" TestPath
            "ONE" sequence{}
              "ONE" capture{}
                "ONE" reference{Lowercase}
      "ONE" capture{}
        "ONE" Expression
          "ONE" sequence{}
            "ONE" capture{}
              "ONE" MathValue
                "ONE" group{n}
                  "ONE" disjunction{}
                    "ONE" capture{}
                      "ONE" TestInvocation
                        "ONE" sequence{}
                          "ONE" capture{}
                            "ONE" reference{Uppercase}
                          "(" literal{"(", "EOF"}
                          ")" group{n?}
                            ")" group{n}
                              ")" sequence{}
                                ")" capture{}
                                  ")" TestValue
                                    ")" group{n}
                                      ")" disjunction{}
                                        ")" sequence{}
                                          ")" capture{}
                                            ")" TestInvocation
                                              ")" sequence{}
                                                ")" capture{}
                                                  ")" reference{Uppercase}
                                        ")" sequence{}
                                          ")" capture{}
                                            ")" TestPath
                                              ")" sequence{}
                                                ")" capture{}
                                                  ")" reference{Lowercase}
                                        ")" capture{}
                                          ")" Expression
                                            ")" sequence{}
                                              ")" capture{}
                                                ")" MathValue
                                                  ")" group{n}
                                                    ")" disjunction{}
                                                      ")" capture{}
                                                        ")" TestInvocation
                                                          ")" sequence{}
                                                            ")" capture{}
                                                              ")" reference{Uppercase}
                                                      ")" capture{}
                                                        ")" TestPath
                                                          ")" sequence{}
                                                            ")" capture{}
                                                              ")" reference{Lowercase}
                                                      ")" capture{}
                                                        ")" reference{Float}
                                                      ")" sequence{}
                                                        ")" literal{"(", "EOF"}
                          ")" literal{")", "EOF"}
            "+" capture{}
              "+" reference{OpAddSub}
            "TWO" capture{}
              "TWO" MathValue
                "TWO" group{n}
                  "TWO" disjunction{}
                    "TWO" capture{}
                      "TWO" TestInvocation
                        "TWO" sequence{}
                          "TWO" capture{}
                            "TWO" reference{Uppercase}
                          "(" literal{"(", "EOF"}
                          ")" group{n?}
                            ")" group{n}
                              ")" sequence{}
                                ")" capture{}
                                  ")" TestValue
                                    ")" group{n}
                                      ")" disjunction{}
                                        ")" sequence{}
                                          ")" capture{}
                                            ")" TestInvocation
                                              ")" sequence{}
                                                ")" capture{}
                                                  ")" reference{Uppercase}
                                        ")" sequence{}
                                          ")" capture{}
                                            ")" TestPath
                                              ")" sequence{}
                                                ")" capture{}
                                                  ")" reference{Lowercase}
                                        ")" capture{}
                                          ")" Expression
                                            ")" sequence{}
                                              ")" capture{}
                                                ")" MathValue
                                                  ")" group{n}
                                                    ")" disjunction{}
                                                      ")" capture{}
                                                        ")" TestInvocation
                                                          ")" sequence{}
                                                            ")" capture{}
                                                              ")" reference{Uppercase}
                                                      ")" capture{}
                                                        ")" TestPath
                                                          ")" sequence{}
                                                            ")" capture{}
                                                              ")" reference{Lowercase}
                                                      ")" capture{}
                                                        ")" reference{Float}
                                                      ")" sequence{}
                                                        ")" literal{"(", "EOF"}
                          ")" literal{")", "EOF"}

A negative lookahead isn't really the right tool for parsing an expression, if that's what you're trying to do? Take a look at the various "expr" examples in the _examples folder.

@alecthomas ya I've got the expression built and working as expected in isolation. I've stripped down the example to the root cause; I have added all the expression parsing rules (terms, factors, etc), and it doesn't affect the outcome.

The problem arises when trying to use an expression and individual values. Sometime I want to be able to capture an individual float or Function or variable. Other times I want the parser to recognize that the float or function or variable is the start of an expression.

This would allow the parser to parse a string like functionName(2) and functionName(one()+one()) and capture the 2 into a float and one()+one() into an expression.

This is tricky bc the individual values I want to capture can also show up on the very most left side of an expression.

I could solve this problem by not allowing individual values and instead always capturing those types in an Expression. I could then check after parsing whether or not an operator was found and make appropriate decisions.

I could also handle this issue by forcing Expressions to be wrapped in ().

If I can avoid either of those options, though, it would be nice. Annoyingly my negative lookahead tactic works as I want it to for float and 1 struct, but not multiple structs.

Traditionally a value is a leaf node in an expression, and everything just "falls out" of that correctly.

Here's a full example of a fairly complex expression parser. Ignore the custom parsing functions, they're just to clean up the expression tree a bit.

I was able to solve the problem with this change

type TestValue struct {
	Literal        *Literal    `parser:"( @@ (?! OpAddSub)"`
	MathExpression *Expression `parser:"| @@ )"`
}

type Literal struct {
	Invocation *TestInvocation `parser:"( @@"`
	Path       *TestPath       `parser:"| @@"`
	Number     *float64        `parser:"| @Float )"`
}

Thanks!