lezer-parser/lezer

Running lezer-generator on a complex grammar results in an error "Goto table too large"

poeschko opened this issue · 6 comments

We have a real-world grammar for the Wolfram Language, which is fairly complex by nature. When running it through lezer-generator, it results in an error Goto table too large, exit code 1. The grammar file is attached. (It relies on a custom tokenizer, but I don't think that's needed to understand or reproduce this issue.)

What I know so far:

  • This used to work with lezer-generator v0.8.5 (after processing the grammar for ~90 seconds) but fails now with v0.13.1.
  • If I remove handling of "Unicode operators" (commenting out | unicodeOperation in the definition of expr), the grammar is generated OK.

Those Unciode operations are rules of the form

Unicode_XXX { expr !precXXX UnicodeOp<unicodeXXX>  expr }

with extra "wrappers"

UnicodeOp<token> { Op<token> }
Op<term> { term }

defined for the corresponding unicode tokens (this helps with the "post-processing" of the generated parse tree). Perhaps that just adds too much complexity?

I just wanted to open this issue as dicussed previosly on lezer-parser/generator#2 (comment). I don't really expect anyone to solve the whole problem for us, but wanted to ask what would be useful next steps...

Would it help to narrow down the specific version (> 0.8.5, <= 0.13.1) where this "broke"?

Should we try to simplify the grammar (e.g. trying to remove the nesting of UnicodeOp / Op), assuming this limit is expected?

Or would it make sense to work towards supporting larger "goto tables"? I could try to open a pull request increasing the "pointer size" from 16 to 32 bit, if that makes sense.

Thanks!

wl.txt

It would definitely be possible to add a 32-bit-goto-table mode to the library, I guess, though I haven't thought through how many other things that would complicate. (And at some point the question is whether you really want to ship parse tables with that kind of size to clients.)

A precise version where this started failing might be interesting—if the tables were a lot smaller before, it's possible that the grammar is triggering some kind of regression that causes the parser generator to emit more states than it needs to. (Though some bugs were fixed where it merged states that couldn't really be merged in this version range, so maybe this is just a symptom of those bugfixes.)

The grammar can still be generated with lezer-generator 0.10.5 but fails to generate with 0.11.0. So I guess some state-collapsing bugfix in 0.11.0 introduced the size issue.

The generated JS file (generated by 0.10 or 0.8) has 290 KB uncompressed and 54 KB compressed. It's on the edge, but still very useful and reasonable.

Looking at the grammar, I think I would indeed recommend collapsing all the different unicode operations into one token, and just looking up the content of the token in the document when post-processing. Having that many different tokens will blow up the goto table to an unreasonable degree (they will have to be listed in every state where they are valid).

There is some goto table compression done in the generator, but when I try to run the grammar you provided through lezer-generator my node process runs out of memory and dies, which has made it diffiult to investigate what is going on there.

In a simpler grammar with a similar large blob of tokens that appear in an expression rule, goto table compression seems to be working fine (the part of the table that associates the tokens with reduce actions is reused between states). But there may be some complication happening in your more complex grammar that my test case doesn't trigger.

@marijnh I am reproducing the Goto table too large error.
I printed some numbers and found this:

Number of terms (buildFullAutomaton input): ~620
Number of states (buildFullAutomaton output): ~9000
Goto index table size: ~100k (Goto table too large)

Is it expected this massive growth of the goto index table?

I also saw that 0.10.5 version had a similar behavior:

Number of terms (buildFullAutomaton input): ~620
Number of states (buildFullAutomaton output): ~9000
Goto index table size: ~37k (barely made it!)

Is there a specific grammar pattern that should be avoided?

I was simplifying the grammar and found that template tokens are making the goto table combinations to explode.
I removed all templates that were in the form of:

MyTemp<token> {
  token
}

The number of terms got reduced by half: ~900 -> 410
The size of the goto table got reduce by 10: 100k -> 11k