Support for negative lookahead groups and recursive capturing.

Question

Support for negative lookahead groups and recursive capturing.

TylerHelmuth opened this issue 2 years ago · 12 comments

I have a situation where my grammar needs to differential between capturing an individual value and capturing values as the same type within a mathematical equation. My struct looks like this:

type value struct {
	Invocation     *invocation     `parser:"( @@"`
	Path           *Path           `parser:"| @@"`
	Float          *float64        `parser:"| @Float"`
	Int            *int64          `parser:"| @Int"`
	MathExpression *mathExpression `parser:"| @@"`
	IsNil          *isNil          `parser:"| @'nil'"`
	Bytes          *byteSlice      `parser:"| @Bytes"`
	String         *string         `parser:"| @String"`
	Bool           *boolean        `parser:"| @Boolean"`
	Enum           *EnumSymbol     `parser:"| @Uppercase"`
	List           *list           `parser:"| @@ )"`
}

Deep down inside the terms and factors of MathExpression it can capture a Invocation, Path, Float, and Int. At the moment if I try to parse something like set(x, 1 + 1), the grammar sees the 1, matches it with @Int, and then doesn't know how to handle the unexpected symbol +.

I can solve this problem for Float and Int by using negative lookahead:

type value struct {
	Invocation     *invocation     `parser:"( @@"`
	Path           *Path           `parser:"| @@"`
	Float          *float64        `parser:"| @Float (?! @OpAddSub | @OpMultDiv)"`
	Int            *int64          `parser:"| @Int (?! @OpAddSub | @OpMultDiv)"`
	MathExpression *mathExpression `parser:"| @@"`
	IsNil          *isNil          `parser:"| @'nil'"`
	Bytes          *byteSlice      `parser:"| @Bytes"`
	String         *string         `parser:"| @String"`
	Bool           *boolean        `parser:"| @Boolean"`
	Enum           *EnumSymbol     `parser:"| @Uppercase"`
	List           *list           `parser:"| @@ )"`
}

Now set(x, 1 + 1) get parsed correctly. But the solution breaks down when I try the same strategy with Path and Invocation.

If I try to parse set(x, y + increment()) without negative lookahead then the Path, y, is matched to Path and the parser doesn't know how to handle the +. But if I add negative lookahead like this:

type value struct {
	Invocation     *invocation     `parser:"( @@ (?! @OpAddSub | @OpMultDiv)"`
	Path           *Path           `parser:"| @@ (?! @OpAddSub | @OpMultDiv)"`
	Float          *float64        `parser:"| @Float (?! @OpAddSub | @OpMultDiv)"`
	Int            *int64          `parser:"| @Int (?! @OpAddSub | @OpMultDiv)"`
	MathExpression *mathExpression `parser:"| @@"`
	IsNil          *isNil          `parser:"| @'nil'"`
	Bytes          *byteSlice      `parser:"| @Bytes"`
	String         *string         `parser:"| @String"`
	Bool           *boolean        `parser:"| @Boolean"`
	Enum           *EnumSymbol     `parser:"| @Uppercase"`
	List           *list           `parser:"| @@ )"`
}

Then I get the following error when building the lexer: Invocation: Arguments: Invocation: structs can only be parsed with @@ or by implementing the Capture or encoding.TextUnmarshaler interfaces

Is there a way to support recursively matching a struct but only when the negative lookahead group is satisfied?

Full grammar can be found here: https://github.com/TylerHelmuth/opentelemetry-collector-contrib/blob/9f074d641c6c6b17b1d4774194ccbfa271f8a0da/pkg/ottl/grammar.go

Answer 1 · 2022-11-01T19:05:14.000Z

Nevermind, I just needed to not include @ in the lookahead group.

Answer 2 · 2022-11-01T19:17:20.000Z

Nevermind.

type value struct {
	Invocation     *invocation     `parser:"( @@ (?! OpAddSub | OpMultDiv)"`
	IsNil          *isNil          `parser:"| @'nil'"`
	Path           *Path           `parser:"| @@ (?! OpAddSub | OpMultDiv)"`
	Float          *float64        `parser:"| @Float (?! OpAddSub | OpMultDiv)"`
	Int            *int64          `parser:"| @Int (?! OpAddSub | OpMultDiv)"`
	MathExpression *mathExpression `parser:"| @@"`
	Bytes          *byteSlice      `parser:"| @Bytes"`
	String         *string         `parser:"| @String"`
	Bool           *boolean        `parser:"| @Boolean"`
	Enum           *EnumSymbol     `parser:"| @Uppercase"`
	List           *list           `parser:"| @@ )"`
}

doesn't panic, but it also doesn't work as expect. Something like one() + 3 still matches Invocation instead of MathExpression.

Interestingly the solution is working for Path. Maybe the issue is that Invocation references zero or more values?

Answer 3 · 2022-11-01T19:44:31.000Z

Can you provide a minimal, working, reproducible case please?

Answer 4 · 2022-11-02T14:59:24.000Z

@alecthomas here is an example:

type A struct {
	B *B `parser:"(@@ (?! '+')"`
	C *C `parser:"| @@)"`
}

type B struct {
	Val    *string `parser:"@Uppercase"`
	LParen *string `parser:"@LParen"`
	RParen *string `parser:"@RParen"`
}

type C struct {
	LHS *B      `parser:"@@"`
	Op  *string `parser:"@'+'"`
	RHS *B      `parser:"@@"`
}

func buildLexer() *lexer.StatefulDefinition {
	return lexer.MustSimple([]lexer.SimpleRule{
		{Name: `String`, Pattern: `"(\\"|[^"])*"`},
		{Name: `OpAdd`, Pattern: `\+`},
		{Name: `LParen`, Pattern: `\(`},
		{Name: `RParen`, Pattern: `\)`},
		{Name: `Uppercase`, Pattern: `[A-Z_][A-Z0-9_]*`},
		{Name: "whitespace", Pattern: `\s+`},
	})
}

parser, _ := participle.Build[A](
    participle.Lexer(buildLexer()),
    participle.Unquote("String"),
    participle.Elide("whitespace"),
)

parsed, err := parser.ParseString("", "ONE() + TWO()")

In this example the negative lookahead will not work and the parser will fail on + after parsing ONE() into A.B. Interestingly, if I remove LParen and RParen from B and parse the string ONE + TWO then the parser correctly parses the result into C.

Answer 5 · 2022-11-02T20:40:53.000Z

Thanks!

This is due to the default lookahead of the parser only being one token. You can solve this with participle.UseLookahead(N). Depending on the complexity of the expression, N might need to be very large.

Answer 6 · 2022-11-02T23:41:13.000Z

@alecthomas I was definitely missing that setting, and that fixed my stripped down example, but it didn't solve my real-life example. I've tweaked my example to reproduce the issue even with UseLookahead. I've pinpointed the issue (I think) to performing multiple negative lookahead with structs and/or something conflicting between the way TestInvocation and TestPath use uppercase and lowercase tokens:

type TestInvocation struct {
	Function  string      `parser:"@(Uppercase | Lowercase)+"`
	Arguments []TestValue `parser:"'(' ( @@ ( ',' @@ )* )? ')'"`
}

type TestPath struct {
	Name string `parser:"@Lowercase ( '.' @Lowercase )*"`
}

type TestValue struct {
	Invocation     *TestInvocation `parser:"( @@ (?! OpAddSub)"`
	Path           *TestPath       `parser:"| @@ (?! OpAddSub)"`
	MathExpression *Expression     `parser:"| @@ )"`
}

type MathValue struct {
	Invocation    *TestInvocation `parser:"( @@"`
	Path          *TestPath       `parser:"| @@"`
	Number        *float64        `parser:"| @Float"`
	Subexpression *Expression     `parser:"| '(' @@ ')' )"`
}

type Expression struct {
	Left     *MathValue `parser:"@@"`
	Operator *string    `parser:"@OpAddSub"`
	Right    *MathValue `parser:"@@"`
}

func Test_issue(t *testing.T) {
	tests := []struct {
		name     string
		input    string
	}{
		{
			name:     "test",
			input:    "one() + two()",
		},
	}

	parser, _ := participle.Build[TestValue](
		participle.Lexer(lexer.MustSimple([]lexer.SimpleRule{
			{Name: `Float`, Pattern: `[-+]?\d*\.\d+([eE][-+]?\d+)?`},
			{Name: `String`, Pattern: `"(\\"|[^"])*"`},
			{Name: `OpAddSub`, Pattern: `\+|\-`},
			{Name: `LParen`, Pattern: `\(`},
			{Name: `RParen`, Pattern: `\)`},
			{Name: `Punct`, Pattern: `[,.\[\]]`},
			{Name: `Uppercase`, Pattern: `[A-Z_][A-Z0-9_]*`},
			{Name: `Lowercase`, Pattern: `[a-z_][a-z0-9_]*`},
			{Name: "whitespace", Pattern: `\s+`},
		})),
		participle.Unquote("String"),
		participle.Elide("whitespace"),
		participle.UseLookahead(-1),
	)

	for _, tt := range tests {
		t.Run(tt.name, func(t *testing.T) {
			var trace strings.Builder
			parsed, err := parser.ParseString("", tt.input, participle.Trace(&trace))
			//fmt.Println(trace.String())
			assert.NoError(t, err)
			assert.NotNil(t, parsed)
		})
	}
}

In this example, which is closer to the complexity of my real grammar, I am unable to parse one() + two() into a MathExpression correctly. Instead, Invocation is skipped correctly (I think) and one is parsed into TestPath and then I get this error:

Error:      	Received unexpected error:
        	            	1:7: unexpected '+'

What is strange is that I don't know what consumers the (). Here is the printout of a trace:

Trace

"ONE" TestValue
  "ONE" group{n}
    "ONE" disjunction{}
      "ONE" sequence{}
        "ONE" capture{}
          "ONE" TestInvocation
            "ONE" sequence{}
              "ONE" capture{}
                "ONE" reference{Uppercase}
              "(" literal{"(", "EOF"}
              ")" group{n?}
                ")" group{n}
                  ")" sequence{}
                    ")" capture{}
                      ")" TestValue
                        ")" group{n}
                          ")" disjunction{}
                            ")" sequence{}
                              ")" capture{}
                                ")" TestInvocation
                                  ")" sequence{}
                                    ")" capture{}
                                      ")" reference{Uppercase}
                            ")" sequence{}
                              ")" capture{}
                                ")" TestPath
                                  ")" sequence{}
                                    ")" capture{}
                                      ")" reference{Lowercase}
                            ")" capture{}
                              ")" Expression
                                ")" sequence{}
                                  ")" capture{}
                                    ")" MathValue
                                      ")" group{n}
                                        ")" disjunction{}
                                          ")" capture{}
                                            ")" TestInvocation
                                              ")" sequence{}
                                                ")" capture{}
                                                  ")" reference{Uppercase}
                                          ")" capture{}
                                            ")" TestPath
                                              ")" sequence{}
                                                ")" capture{}
                                                  ")" reference{Lowercase}
                                          ")" capture{}
                                            ")" reference{Float}
                                          ")" sequence{}
                                            ")" literal{"(", "EOF"}
              ")" literal{")", "EOF"}
        "+" lookaheadGroup{}
          "+" reference{OpAddSub}
      "ONE" sequence{}
        "ONE" capture{}
          "ONE" TestPath
            "ONE" sequence{}
              "ONE" capture{}
                "ONE" reference{Lowercase}
      "ONE" capture{}
        "ONE" Expression
          "ONE" sequence{}
            "ONE" capture{}
              "ONE" MathValue
                "ONE" group{n}
                  "ONE" disjunction{}
                    "ONE" capture{}
                      "ONE" TestInvocation
                        "ONE" sequence{}
                          "ONE" capture{}
                            "ONE" reference{Uppercase}
                          "(" literal{"(", "EOF"}
                          ")" group{n?}
                            ")" group{n}
                              ")" sequence{}
                                ")" capture{}
                                  ")" TestValue
                                    ")" group{n}
                                      ")" disjunction{}
                                        ")" sequence{}
                                          ")" capture{}
                                            ")" TestInvocation
                                              ")" sequence{}
                                                ")" capture{}
                                                  ")" reference{Uppercase}
                                        ")" sequence{}
                                          ")" capture{}
                                            ")" TestPath
                                              ")" sequence{}
                                                ")" capture{}
                                                  ")" reference{Lowercase}
                                        ")" capture{}
                                          ")" Expression
                                            ")" sequence{}
                                              ")" capture{}
                                                ")" MathValue
                                                  ")" group{n}
                                                    ")" disjunction{}
                                                      ")" capture{}
                                                        ")" TestInvocation
                                                          ")" sequence{}
                                                            ")" capture{}
                                                              ")" reference{Uppercase}
                                                      ")" capture{}
                                                        ")" TestPath
                                                          ")" sequence{}
                                                            ")" capture{}
                                                              ")" reference{Lowercase}
                                                      ")" capture{}
                                                        ")" reference{Float}
                                                      ")" sequence{}
                                                        ")" literal{"(", "EOF"}
                          ")" literal{")", "EOF"}
            "+" capture{}
              "+" reference{OpAddSub}
            "TWO" capture{}
              "TWO" MathValue
                "TWO" group{n}
                  "TWO" disjunction{}
                    "TWO" capture{}
                      "TWO" TestInvocation
                        "TWO" sequence{}
                          "TWO" capture{}
                            "TWO" reference{Uppercase}
                          "(" literal{"(", "EOF"}
                          ")" group{n?}
                            ")" group{n}
                              ")" sequence{}
                                ")" capture{}
                                  ")" TestValue
                                    ")" group{n}
                                      ")" disjunction{}
                                        ")" sequence{}
                                          ")" capture{}
                                            ")" TestInvocation
                                              ")" sequence{}
                                                ")" capture{}
                                                  ")" reference{Uppercase}
                                        ")" sequence{}
                                          ")" capture{}
                                            ")" TestPath
                                              ")" sequence{}
                                                ")" capture{}
                                                  ")" reference{Lowercase}
                                        ")" capture{}
                                          ")" Expression
                                            ")" sequence{}
                                              ")" capture{}
                                                ")" MathValue
                                                  ")" group{n}
                                                    ")" disjunction{}
                                                      ")" capture{}
                                                        ")" TestInvocation
                                                          ")" sequence{}
                                                            ")" capture{}
                                                              ")" reference{Uppercase}
                                                      ")" capture{}
                                                        ")" TestPath
                                                          ")" sequence{}
                                                            ")" capture{}
                                                              ")" reference{Lowercase}
                                                      ")" capture{}
                                                        ")" reference{Float}
                                                      ")" sequence{}
                                                        ")" literal{"(", "EOF"}
                          ")" literal{")", "EOF"}

Answer 7 · 2022-11-03T00:23:12.000Z

A negative lookahead isn't really the right tool for parsing an expression, if that's what you're trying to do? Take a look at the various "expr" examples in the _examples folder.

Answer 8 · 2022-11-03T01:12:52.000Z

@alecthomas ya I've got the expression built and working as expected in isolation. I've stripped down the example to the root cause; I have added all the expression parsing rules (terms, factors, etc), and it doesn't affect the outcome.

The problem arises when trying to use an expression and individual values. Sometime I want to be able to capture an individual float or Function or variable. Other times I want the parser to recognize that the float or function or variable is the start of an expression.

This would allow the parser to parse a string like functionName(2) and functionName(one()+one()) and capture the 2 into a float and one()+one() into an expression.

This is tricky bc the individual values I want to capture can also show up on the very most left side of an expression.

Answer 9 · 2022-11-03T01:43:01.000Z

I could solve this problem by not allowing individual values and instead always capturing those types in an Expression. I could then check after parsing whether or not an operator was found and make appropriate decisions.

I could also handle this issue by forcing Expressions to be wrapped in ().

If I can avoid either of those options, though, it would be nice. Annoyingly my negative lookahead tactic works as I want it to for float and 1 struct, but not multiple structs.

Answer 10 · 2022-11-03T02:21:43.000Z

Traditionally a value is a leaf node in an expression, and everything just "falls out" of that correctly.

Answer 11 · 2022-11-03T04:21:06.000Z

Here's a full example of a fairly complex expression parser. Ignore the custom parsing functions, they're just to clean up the expression tree a bit.

Answer 12 · 2022-11-03T14:18:28.000Z

I was able to solve the problem with this change

type TestValue struct {
	Literal        *Literal    `parser:"( @@ (?! OpAddSub)"`
	MathExpression *Expression `parser:"| @@ )"`
}

type Literal struct {
	Invocation *TestInvocation `parser:"( @@"`
	Path       *TestPath       `parser:"| @@"`
	Number     *float64        `parser:"| @Float )"`
}

Thanks!