lezer-parser/lezer

Token parsing behavior change in generator between 1.2.0 and 1.2.1

r3c opened this issue · 2 comments

r3c commented

Hello!

I'm trying to bump @lezer/generator to latest version 1.2.2 (from 1.2.0) currently and am facing a regression in our unit tests, which seems to be due to a behavior change introduced in 1.2.1.

Here is a reproduction grammar:

@top Root {
  ConflictingToken |
  SymbolToken
}

@tokens {
  @precedence {
    ConflictingToken,
    SymbolToken
  }

  ConflictingToken {
    'conflict'
  }

  SymbolToken {
    $[a-zA-Z_]+
  }
}
  • When parsing input string "conf" the parser expectedly emits a "SymbolToken" on both versions.
  • When parsing input string "conflict" the parser expectedly emits a "ConflictingToken" on both versions.
  • However the behavior is different when parsing input string "conflicting":
    • Version 1.2.0 emits a "SymbolToken", which is the behavior we rely on for now
    • Version 1.2.1 emits a "ConflictingToken" followed by a "SymbolToken" (matching the trailing "ing" characters)

I'm not sure the later behavior is intentional since it interferes with parsing most language keywords. Inverting the precedence of the two rules won't work either, since all "conf", "conflict" and "conflicting" inputs would all be matched as "SymbolToken". I wonder if the change could have been introduced in lezer-parser/generator@b38d018 ; would you mind sharing your thoughts about this?

Regards,
Rémi

You appear to have been relying on a bug in the way precedences were applied. Since you explicitly say ConflictingToken has higher precedence than SymbolToken, the new behavior is what the system is supposed to do.

It is almost always preferable to use @specialize to recognize keywords, rather than including them as separate tokens.

r3c commented

Hey @marijnh, you were very right, our grammar file was doing a wrong usage of precedence to solve overlapping symbol issues. Thanks a lot for pointing that out!