alecthomas/participle

Calculate follow-set automatically from multiple production alternatives

mewmew opened this issue · 31 comments

First of all, thanks a lot for sharing participle with the world! I felt very happy to discover what feels like a very novel approach to parser generation, parser libraries, etc.

I took a stab at trying to rewrite a grammar for LLVM IR to use participle, but reached the following stumbling block and thought I'd reach out and ask if it is intended by design (to keep the parser simple), or if I've done something wrong, or otherwise, if we could seek to resolve so that the follow set of a token is calculated from each present production alternatives. In this case, the follow set of "target" should be {"datalayout", "triple"}.

I wish to write the grammar as example 2, but have only gotten example 1 to work so far. Any ideas?

Cheers :)
/u

Input

LLVM IR input source:

source_filename = "foo.c"
target datalayout = "bar"
target triple = "baz"

Example 1

Grammar:

type Module struct {
	Decls []*Decl `{ @@ }`
}

type Decl struct {
	SourceFilename string      `  "source_filename" "=" @String`
	TargetSpec     *TargetSpec `| "target" @@`
}

type TargetSpec struct {
	DataLayout   string `  "datalayout" "=" @String`
	TargetTriple string `| "triple" "=" @String`
}

Example run:

u@x220 ~/D/g/s/g/m/low> low a.ll 
&main.Module{
    Decls: {
        &main.Decl{
            SourceFilename: "foo.c",
            TargetSpec:     (*main.TargetSpec)(nil),
        },
        &main.Decl{
            SourceFilename: "",
            TargetSpec:     &main.TargetSpec{DataLayout:"bar", TargetTriple:""},
        },
        &main.Decl{
            SourceFilename: "",
            TargetSpec:     &main.TargetSpec{DataLayout:"", TargetTriple:"baz"},
        },
    },
}

Example 2

Grammar:

type Module struct {
	Decls []*Decl `{ @@ }`
}

type Decl struct {
	SourceFilename string `  "source_filename" "=" @String`
	DataLayout     string `| "target" "datalayout" "=" @String`
	TargetTriple   string `| "target" "triple" "=" @String`
}

Example run:

u@x220 ~/D/g/s/g/m/low> low a.ll
2017/08/27 21:15:38 a.ll:3:7: expected ( "datalayout" ) not "triple"

A potential solution to this problem would be to refine expression.Parse such that it attempts to parse each alternative (i.e. not quit early on parse error), and returns a value if an alternative successfully parses. Otherwise it throws a panic as before.

This could be accomplished with a lexer that may seek to previous positions, such that expression.Parse could store the current position (measured in tokens) in the input stream, before attempting to parse the different alternatives. If an error occurs, the lexer could restore the position using seek and attempt to parse another alternative, until the list of alternatives have been exhausted or an alternative successfully parses.

For a very crude implementation of this, see https://github.com/alecthomas/participle/compare/master...mewpull:hack?expand=1#diff-f615844d3497ff38db57e459d6ef657bR314

The intention is to initiate a discussion regarding various approaches to solve this issue in an elegant fashion rather than merging the above hack. Feel free to discuss other alternatives :) One way to go about it would also be to compute the follow-set at each token in each production rule alternative. This approach would require rather heavy computation in advance, and may suffer from shift/reduce conflicts for some grammars.

From my understanding, it seems like participle uses something along similar to PEG to resolve these conflicts, i.e. to always return the first match? I still have to get better accustomed with the nitty-gritty internals of participle, but so far I've been enjoying reading the code base and feel you've manage to keep the quality high.

Looking forward to discussing this issue further with you.

Cheers,
/u

And for the purpose of illustration, the following lexer implements LexSeeker https://play.golang.org/p/cfSFLdVJ5I which the following example program makes use of https://play.golang.org/p/bLhp-rprOV to parse the original LLVM IR input, using the grammar of example 2.

Example run:

u@x220 ~/D/g/s/g/m/low> low a.ll
participle: ignoring parse error (i=1): <source>:3:7: expected ( "datalayout" ) not "triple"
participle:    tag: | "target" "triple" "=" @String
&main.Module{
    Decls: {
        &main.Decl{SourceFilename:"foo.c", DataLayout:"", TargetTriple:""},
        &main.Decl{SourceFilename:"", DataLayout:"bar", TargetTriple:""},
        &main.Decl{SourceFilename:"", DataLayout:"", TargetTriple:"baz"},
    },
}

Notice the debug message which signals that a parse error has occurred (and been ignored), before a successful parse of an alternative was achieved.

Hi, I'm glad you're finding Participle useful!

The issue you're hitting is (as you've noticed) because Participle's current parser backend is LL(1) and is thus restricted to a single look-ahead token. This is intentional, to reduce implementation complexity, reduce backtracking overhead, and simplify error reporting.

That said, it does restrict what grammars can be parsed and does put more burden on the API consumer, and I have pondered moving to a more flexible parser backend.

For now, you can solve your current problem like so:

type Module struct {
	Decls []*Decl `{ @@ }`
}

type Decl struct {
	SourceFilename string `  "source_filename" "=" @String`
	DataLayout     string `| "target" (  "datalayout" "=" @String`
	TargetTriple   string `            | "triple" "=" @String )`
}

It should be possible to refactor most grammars in this way, though I appreciate it is less than ideal.

IIUC your PR is basically implementing an LL(k) parser, which might be a good middle ground for ease of use, simplicity and power. I'll have a ponder.

I've been thinking about this a bit today. My biggest concern with any kind of backtracking is increasing parse time. To avoid this, I was pondering the following approach.

  1. Add some form of support for lookahead to the lexer; probably some kind of queuing lexer.
  2. When building te parser, determine the number of look-ahead tokens needed to disambiguate at each branching node.
  3. When parsing, look-ahead the corresponding number of tokens, matching the token type and choosing the appropriate path.

For example, the following grammar has an ambiguity within Entry; both Attribute and Group accept an Ident as the first token, and are disambiguated by the second token ("=" or "{"). By examining the sequence of terminal nodes from each branch we can determine how many tokens of lookahead we will need at that branch, and what type of tokens. So for Entry we'll need two tokens of lookahead.

type Config struct {
  Entries []*Entry `@@ { @@ }`
}

type Entry struct {
  Attribute *Attribute `@@`
  Group *Group `| @@` 
}

type Attribute struct {
  Key string `@Ident "="`
  Value string `@String`
}

type Group struct {
  Name string `@Ident "{"`
  Entries []*Entry `@@ { @@ } "}"`
}

However, this does not solve the recursive problem of the LLVM grammar (simplified example below). I'm not 100% sure how to solve this, but perhaps one option is to detect the recursion, exclude the recursive branch initially, then reparent it with the recursive branch type if its disambiguating tokens match.

eg. Given the following input: int64 (int32, int32)

  1. Ignore Func field, so we end up with:
    Type{Int: "int64"}
  2. "(" matches the second token of the recursive type, so we reparent the already parsed struct:
    Type{Func: Func{ReturnType: Type{Int: "int64"}}}
  3. Continue parsing the recursive type:
    Type{Func: Func{ReturnType: Type{Int: "int64"}, Args: []*Type{Type{Int: "int32"}, Type{Int: "Int32"}}}
type Type struct {
  Func *FuncType `@@`
  Int string `@IntType`
}

type FuncType struct {
  ReturnType *Type `@@ "("`
  Args []*Type `@@ { "," @@ } ")"`
}

It may be that this is too bespoke a solution though, and won't work for the more general case. Thoughts?

For the look-ahead, I like the idea of computing needed look-ahead to disambiguate for each branch, and then connect lexer and parser, such that the parser can get the needed number of tokens for each branch.

The solution to recursion does require some thought. Would your solution also work for when the return type of a function is a function?

I actually found some really nice lecture notes that describe both of these problems (common to all LL(1) parsers) and they have a formal solution to the recursion issue.


Eliminating Left Recursion

Assume we have a non-terminal that is left recursive:

A → Aα A → β | γ | ... | δ

To eliminate the left recursion, we create two new non-terminals, N and T.
We then rewrite the above productions into:

A → N T N → β | γ | ... | δ
T → α T | λ

For example,

Expr → Expr + id
Expr → id

becomes

Expr → N T
N → id
T → + id T | λ

This simplifies to:

Expr → id T
T → + id T | λ

In effect the solution I outlined above seems very similar, just more complicated due to the nature of Participle.

That's really great! Do you think it will be possible to eliminate left recursion without impacting the user-defined grammar? That is, can Participle be cleaver enough behind the scenes, to do these rewrites while keeping the user grammar intuitive and straight forward. A grammar including the elimination of left recursion tends to get quite intricate (read ugly), especially when the non-terminal being part of the left recursion is nested in several levels of precedence, e.g. expressions in C.

That is definitely the goal. Ideally in my mind you'd be able to express the grammar without having to do any refactoring, as I think that is one of the (potential) strengths of Participle. I suspect the implementation details will be a bit intricate.

That is definitely the goal. Ideally in my mind you'd be able to express the grammar without having to do any refactoring, as I think that is one of the (potential) strengths of Participle. I suspect the implementation details will be a bit intricate.

Very glad to hear. It is an ambitions goal, for sure. Seeing how fun has been to play with Participle as a user, I truly hope it will be possible to take this route and to make it support more involved use cases, perhaps even with a simpler use than is defined today.

I was thinking of bringing this up with you before, that it would be interesting to investigate if the grammar in the struct tags could be concatinated before parsing begins, so that whether a user places a token (e.g. "(") on one line or another should not make a difference. The only thing that matters, is to tie specific parts of the grammar to the associated struct members, i.e. through @. I think this could be possible to tackle in combination with the lookahead per parser branch, although I may be overlooking some aspect. Still, would be quite lovely if that was the case.

I was thinking of bringing this up with you before, that it would be interesting to investigate if the grammar in the struct tags could be concatinated before parsing begins, so that whether a user places a token (e.g. "(") on one line or another should not make a difference. The only thing that matters, is to tie specific parts of the grammar to the associated struct members, i.e. through @. I think this could be possible to tackle in combination with the lookahead per parser branch, although I may be overlooking some aspect. Still, would be quite lovely if that was the case.

This should already be the case. This:

type Entry struct {
  Key string `@String "="`
  Value string `@String`
}

Should be identical to this:

type Entry struct {
  Key string `@String`
  Value string `"=" @String`
}

Are you seeing different behaviour?

This should already be the case.

Glad to head that it is intended to work like this already. I think I was mostly considering the recursive example you posted, where I imagined that "(" was required to be moved in order to have a disambiguating token for when the start of a Func ends, or something along those lines; i.e.

We have:

type FuncType struct {
  ReturnType *Type `@@ "("`
  Args []*Type `@@ { "," @@ } ")"`
}

But want to write:

type FuncType struct {
  ReturnType *Type `@@`
  Args []*Type `"(" @@ { "," @@ } ")"`
}

Where "(" was used for disambiguation.

Similarly, and as discussed before, we have the following example. But for this one, I believe that the lookahead per branch would already resolve the issue, so you can write this in both of the following ways?

Have:

type Decl struct {
	SourceFilename string `  "source_filename" "=" @String`
	DataLayout     string `| "target" (  "datalayout" "=" @String`
	TargetTriple   string `            | "triple" "=" @String )`
}

Want:

type Decl struct {
	SourceFilename string `  "source_filename" "=" @String`
	DataLayout     string `| "target" "datalayout" "=" @String`
	TargetTriple   string `| "target" "triple" "=" @String`
}

Where "target" was factored out to resolve the need for lookahead.

ickby commented

Hello,
may I ask about the status of this issue? I currently ran into the problem with the ambiguity in the grammar, where distinction happens only at the second token. From what I read in this issue you had ideas about solving.

Unfortunately I haven't had time to implement the solution, mostly because it is theoretically simple but implementing it is not quite so simple :(

ickby commented

From the commit it seems you have the issue fixed, many thanks! Going to try it over the next days!

Just tried it with the grammar above

type Module struct {
	Decls []*Decl `{ @@ }`
}

type Decl struct {
	SourceFilename string `  "source_filename" "=" @String`
	DataLayout     string `| "target" "datalayout" "=" @String`
	TargetTriple   string `| "target" "triple" "=" @String`
}

And the input file

source_filename = "foo.c"
target datalayout = "bar"
target triple = "baz"

which results in the following parse error:

2018/03/08 17:16:03 <source>:3:8: expected ( "datalayout" ) not "triple"

@ickby So not quite there yet, though it looks like @alecthomas is definitely moving it in the right direction with the lookahead lexer. @alecthomas how are you doing? Want to bounce some ideas?

Watching with anticipation! Can't wait to do some serious work in participle :)

@mewmew is correct; that commit is necessary, but just adds lookahead to the lexer. The next step is to compute the lookahead tables for each node, then the final step is to use the lookahead tables at each node to make decisions. Hopefully I can get to this in the next couple of weeks. Apologies for how long this is taking, I really want to get this working myself, as it will make Participle much more useful, but life has other plans at the moment.

ickby commented

Take your time, it's open source and it should be fun :)
Glad your are working on this!

@mewmew is correct; that commit is necessary, but just adds lookahead to the lexer. The next step is to compute the lookahead tables for each node, then the final step is to use the lookahead tables at each node to make decisions. Hopefully I can get to this in the next couple of weeks. Apologies for how long this is taking, I really want to get this working myself, as it will make Participle much more useful, but life has other plans at the moment.

Out of curiosity, would it be possible to make the lookahead configurable? As to have 2 tokens of lookahead, etc? As participle is doing most of this in runtime, it should even be possible to do this dynamically, to disambiguate based on (an arbitrary number of) lookahead tokens. Would be really cool as some grammars would express more cleanly without the constraint of a single lookahead token. At the same time, for machine generated tables at parser generation time this would be unfeasible as the tables would be huge. Thus a runtime parser have the unique opportunity to do this without having to generate huge tables ahead of time. Correct me if I'm mistaken?

EDIT: For clarity, I would much prefer Participle to Do the right thing™ rather than having the lookahead be configurable if it can be made dynamically to take these kind of decisions.

That's exactly what it will do. The idea is that for each branching node of the grammar it will a) determine how many tokens of lookahead are needed to disambiguate and b) build a table that selects the correct branch based on peeking ahead.

Good news @mewmew, this passes!

I just need to do some more cleanup and finish up some tests, and I'll push it out. Probably a bit later tonight.

package participle

import (
	"testing"

	"github.com/gotestyourself/gotestyourself/assert"
)

type Module struct {
	Decls []*Decl `{ @@ }`
}

type Decl struct {
	SourceFilename string `  "source_filename" "=" @String`
	DataLayout     string `| "target" "datalayout" "=" @String`
	TargetTriple   string `| "target" "triple" "=" @String`
}

func TestComputeLookahead(t *testing.T) {
	g := &Module{}

	p, err := Build(g, nil)
	assert.NilError(t, err)

	err = p.ParseString(`
		source_filename = "foo.c"
		target datalayout = "bar"
		target triple = "baz"
	`, g)
	assert.NilError(t, err)
	assert.DeepEqual(t,
		g,
		&Module{
			Decls: []*Decl{
				{SourceFilename: "foo.c"},
				{DataLayout: "bar"},
				{TargetTriple: "baz"},
			},
		})
}

That's wonderful to hear @alecthomas! Great work.

I'll stay attuned!

Cheerful regards and happy Walpurgis,
/u

It currently only supports lookahead at disjunction nodes (a | b | c), not at optional ([ a ] b) or repetition ({ a } b) nodes. This needs to be fixed, but I think I'll push it out anyway as it is already an improvement.

There are still some kinks that need to be debugged, but overall I'm pretty happy with the change.

During the course of the change, it exposed an issue with automatically stripping quoted strings in the lexer, the fix for which ended up being backwards incompatible.

I've added all the examples from this issue as tests, and they all pass. If anyone wants to try this branch out it would be greatly appreciated.

Remaining issues:

  • One of the tests doesn't pass.
  • The SQL example is broken.
  • Optionals and repetitions do not use lookahead, but should.

Note that I don't have a good solution to recursive grammars yet, if ever, but at least they are reported as errors rather than crashing.

Hi Alec!
Once again, thanks for working on this! It definitely makes participle more pleasant to use.
I've been trying to convert my grammar for LLVM IR to use participle. For simple cases it works fairly well, however there are recursive structures (e.g. Type referring to Type, such as for pointer types). How would you suggest handling those cases?

Contrived example from:
https://gist.github.com/mewmew/a2487392d5519ef49658fd8f84d9eed5#file-ll-bnf-L820

Type
	: IntType
	| PointerType
;

IntType
	: int_type // i32
;

PointerType
	: Type "*" // i32**
;

Another question I have stumbled upon is how to extend the default parser with further lexemes? I want to use the default parser for String as that has a non-trivial regexp (or EBNF). At the same time, I want to define LocalIdent (e.g. %32 or %foo). Is there a canonical way to do so? I know how to filter unneeded lexemes (e.g. through Elide)

Cheers,
/u

I've been trying to convert my grammar for LLVM IR to use participle. For simple cases it works fairly well, however there are recursive structures (e.g. Type referring to Type, such as for pointer types). How would you suggest handling those cases?

Yes, as my last comment (which you may have missed) mentions, recursive grammars of that form are not supported (yet). You'll have to refactor the grammar.

I'm still working on making something like the following usable, which should be possible:

type Type struct {
	Type     string    `  @("int" | "float" | "string")`
	Function *FuncType `| @@`
}

type FuncType struct {
	Return *Type   `@@`
	Args   []*Type `"(" @@ { "," @@ } ")"`
}

Another question I have stumbled upon is how to extend the default parser with further lexemes? I want to use the default parser for String as that has a non-trivial regexp (or EBNF). At the same time, I want to define LocalIdent (e.g. %32 or %foo). Is there a canonical way to do so? I know how to filter unneeded lexemes (e.g. through Elide)

The default lexer isn't really extensible as it's based on text/scanner, which is inflexible but fast. The upside is speed, the downside is lack of flexibility.

The easiest solution is to use the regex lexer, but alternatively you could provide your own implementations of the lexer.Definition and lexer.Lexer interfaces.

I've been trying to convert my grammar for LLVM IR to use participle. For simple cases it works fairly well, however there are recursive structures (e.g. Type referring to Type, such as for pointer types). How would you suggest handling those cases?

I'm working on this now. I can see what's happening, and it should be solvable within the lookahead system, which is good news!

I'm working on this now. I can see what's happening, and it should be solvable within the lookahead system, which is good news!

That's great to hear! I'll keep tracking your work with anticipation.

The default lexer isn't really extensible as it's based on text/scanner, which is inflexible but fast. The upside is speed, the downside is lack of flexibility.

The easiest solution is to use the regex lexer, but alternatively you could provide your own implementations of the lexer.Definition and lexer.Lexer interfaces.

Ok, thanks for the info. Do you have any example or a regexp grammar that extends the default lexer but still supports quoted string literals? Otherwise, I'll write my own.

Ok, thanks for the info. Do you have any example or a regexp grammar that extends the default lexer but still supports quoted string literals? Otherwise, I'll write my own.

The SQL example has a custom regex based lexer that might get you started.

Thanks ;)

Lookahead is now much more robust. This doesn't solve left-recursion, but that is a different issue.