lezer-parser/lezer

Bug: parse fails when text contains many "skip" nodes

JuneKelly opened this issue · 13 comments

This bug is quite hard to summarize, so I've written up a full explanation (and a minimal reproduction) here: https://github.com/JuneKelly/cm6-fold-bug-demo


The Problem

At Overleaf, while working on a lezer parser for LaTeX, we stumbled on an apparent bug in the interaction of lezer with code-mirror. This bug manifests when we have a foldable region of the document, which contains a lot of "skip" nodes, such as Comment or Whitespace. In this case, when we click on the fold marker in the gutter, the fold only covers a small region of text, and seems to end at an erroneous syntax error.

Parsing the same document, with the same lezer parser, from a node process (feeding it the entire document as a string), does not yield any parse errors. The resulting tree is correct, with no error nodes. Thus, there seems to be a difference in behaviour between those two environments, depending on whether the parser is fed by a single string, or by (presumably) the incremental parsing machinery in CodeMirror.

Here is a demonstration in gif format:

fold-problem-demo

To re-iterate, there seems to be an underlying problem in either lezer, or codemirror (or the interaction of the two), and this problem manifests as a problem with code folding.

For example, say we have a function body containing lots of comments:

foo() {
  x = 1;
  // bar
  // baz
  // quux
  <and so on for a few hundred lines>
  return true;
}

Trying to fold the foo function would not fold the entire function body. It would instead stop at an error marker at some arbitrary point in the middle of the function.

(Note: in our specific case, this problem manifested with large LaTeX environments, like \begin{document}...\end{document}, containing lots of comment lines, which is fairly normal for realistic LaTeX code. However, the bug is evident in other similar constructs like brace-delimited blocks, or xml tags, and so on.)

Hypothesis

It seems that the parser is getting tripped up on having so many "skip" nodes, and giving up with an incomplete parse tree. The foldInside prop then cannot accurately fold the contents of the function body, leading to the observed behaviour.

(More details in the reproduction repository: https://github.com/JuneKelly/cm6-fold-bug-demo

What does syntaxTree(view.state).toString() look like when this occurs? Could it be that the editor has only parsed part of the content at this point? There's a heuristic in the folding code to not try to fold nodes that are unfinished, but that may be going wrong in your case somehow.

Thanks! I've added a window.__printTree() helper to the reproduction code.

When printed, we get this:

Program(Comment,Comment,Comment,Comment,Whitespace,Block(Word,Whitespace,OpenBrace,Whitespace,
BlockContent(Word,Whitespace,Word,Whitespace,Word,Whitespace,Word,Whitespace,Word,Whitespace,
Word,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,
Comment,Whitespace,Word,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,
Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,
Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,
Comment,Whitespace,Comment,Whitespace,Word),⚠),Whitespace,Comment,Whitespace,Comment,Whitespace,
Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,
Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,
Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Block(Word,Whitespace,Comment,Whitespace,
Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,
Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,
Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,
Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,Comment,Whitespace,⚠,
BlockContent(Word,Whitespace,Word),Whitespace,CloseBrace),Whitespace,
Block(Word,Whitespace,OpenBrace,Whitespace,BlockContent(Word,Whitespace,Word),
Whitespace,CloseBrace),Whitespace)

Or, in a cleaned up form:

Program(
  Comment,Comment,Comment,Comment,Whitespace,
  Block(
    Word,Whitespace,OpenBrace,Whitespace,
    BlockContent(Word,Whitespace,...,Comment,Whitespace,Word),
    ⚠
  ),
  Whitespace,Comment,Whitespace,Comment,Whitespace,
  ...,Comment,Whitespace,
  Block(
    Word,Whitespace,Comment,Whitespace,
    ...,Whitespace,Comment,Whitespace,
    ⚠,
    BlockContent(Word,Whitespace,Word),Whitespace,CloseBrace
  ),Whitespace,
  Block(
    Word,Whitespace,OpenBrace,Whitespace,
    BlockContent(Word,Whitespace,Word),
    Whitespace,CloseBrace
  ),
  Whitespace
)

Does the editor content conform to the grammar when this happens?

Would it be possible to set up a minimal editor that displays the problem?

Yes, the content in the editor conforms to the grammar. Running the parser directly against the same text produces a tree without any errors.

The editor setup in the linked repo (https://github.com/JuneKelly/cm6-fold-bug-demo), is a (fairly) small reproduction of the issue. I found that the problem goes away when there is less text inside the Block node, and the problem seems to be caused by having lots of skip nodes inside a foldable block, so it's hard to recreate the problem with smaller text. Quantity seems to be part of the problem :)

Oh, I somehow missed the repository link. That helped isolate this. It was a bug in incremental parsing. Attached patch should help.

Fabulous, thanks!

Hi @marijnh,

I just want to double check if the following logic is now correct. We implemented a LogQL parser with lezer: https://github.com/grafana/lezer-logql

Within that repo we have a tree-viz.html file, which might help with our grammar.

However, since this change whitespaces after certain nodes are counted to that node.
Example is {} + {} for our grammar. If you put that into the tree-viz.html you'll see, that the Selector node will be from 0 to 2 on lezer versions before 0.16.3. With the change of this PR, the node will go from 0 to 3.

Is this the expected behavior now?

Is the input valid or invalid when this happens?

The input is valid.

Missed to add, that a Selector in our case is this grammar https://github.com/grafana/lezer-logql/blob/main/src/logql.grammar#L23:

Selector {
  "{" Matchers "}" |
  "{" Matchers "," "}"
}

The parser will match any whitespace after the Selector though.

The input is valid.

But then the code changed in lezer-parser/lr@e01aecb will never run, so it seems unlikely that this is related to that patch.

Hm, I tried with 0.16.3 and the whitespaces were added. With 0.16.2 there wasn't a whitespace added.

You can change the lezer-parser versions easily in https://github.com/grafana/lezer-logql/blob/main/tools/tree-viz.html.

Sorry! The input is not valid. You are right:
image

Then this is expected behavior. The parser stops trying to parse the selector at the + token, and cuts off the node it started at that point.