r-lib/tree-sitter-r

`next`: Can multi-line nodes with missing RHS span all whitespace?

lionel- opened this issue · 7 comments

Consider:

text <- "1 +
  "

parser_parse(parser, text)
(program [(0, 0), (1, 2)]
  (binary_operator [(0, 0), (1, 0)]
    lhs: (float [(0, 0), (0, 1)])
    operator: "+" [(0, 2), (0, 3)]
    rhs: (identifier [(1, 0), (1, 0)])
  )
)

and

text <- "1 +

  "

parser_parse(parser, text)
(program [(0, 0), (2, 2)]
  (binary_operator [(0, 0), (2, 0)]
    lhs: (float [(0, 0), (0, 1)])
    operator: "+" [(0, 2), (0, 3)]
    rhs: (identifier [(2, 0), (2, 0)])
  )
)

IIUC (I'm just assuming) the empty identifier token is inserted in the tree to avoid introducing an error?

It seems to be inserted after all trailing newlines. Could it be inserted after trailing whitespace as well? This would make it easier to find spanning nodes when the cursor is in this whitespace.

It seems the empty token isn't a general principle that can be counted on though. Is it intended that the following case doesn't produce an empty token for body?

text <- "function()
  "

parser_parse(parser, text)
(program [(0, 0), (1, 2)]
  (function_definition [(0, 0), (1, 0)]
    name: "function" [(0, 0), (0, 8)]
    parameters: (parameters [(0, 8), (0, 10)]
      "(" [(0, 8), (0, 9)]
      ")" [(0, 9), (0, 10)]
    )
  )
)

In any case, it would be helpul if all nodes that greedily swallow whitespace could span the largest whitespace span as possible. For instance in this case I would expect the function node to span [(0, 0), (1, 2)].

For function_definition, the body is optional() so if there isn't one then it just doesn't appear at all, and that isn't an error

optional(field("body", $._expression))

For binary operators like +, the RHS is not optional

field("rhs", $._expression)

So, yes, the insertion of the zero width rhs: (identifier [(1, 0), (1, 0)]) is tree-sitter's error recovery in action. It detects that it "looks like" we are in a binary + node, and it determines that we need to have some R $._expression on the RHS to complete the + node, even if it isn't there. I'm not sure why it chooses identifier over some other named node that make up the _expression choice() call (

_expression: $ => choice(
), but the key is that it is zero width.

In theory we should be able to show that (identifier [(1, 0), (1, 0)]) as MISSING as well, but due to a tree-sitter bug we cannot detect that using the API at the moment.

I am mixed on our usage of optional(), as:

  • It allows us to detect partial code like function() as function_definitions without relying on tree-sitter's error recovery, which is good but not always right
  • At the high cost of making the grammar more ambiguous in many cases
  • And the cost of requiring usage of prec.right() in some cases, where if we followed the R bison grammar we'd use prec.left(). The tree-sitter maintainers have said that you shouldn't try and exactly translate a bison grammar, but it is still a useful reference.

see this for a discussion of the prec.right/left issue

tree-sitter-r/grammar.js

Lines 16 to 37 in 03e6c38

// ---------------------------------------------------------------------------------------
// NOTE ON PREC.RIGHT:
//
// A few things in this table are left associative in R's grammar, but we are forced to
// make them right associative to get the behavior we want. This includes:
// - $, @
// - ::, :::
//
// We are forced to do this because we want these nodes to have an `optional()` RHS.
// While `dplyr::` isn't parsable R code, we want it to be recognized as a `::` node with
// a known package name, as this helps us generate completions.
//
// The trailing `optional()` then means there are two interpretations of `foo::bar`:
// - [foo::][bar] == [namespace][identifier]
// - [foo::bar] == [namespace]
//
// We want to force the latter, which means that the namespace rule has to be right
// associative to "prefer matching a rule that ends later".
//
// In practice we don't think this will matter for the rules we've had to swap the order
// on, since they are typically the only things at their numeric precedence rank.
// ---------------------------------------------------------------------------------------

We may be forced to use optional() in some cases though, as we want dplyr:: to reliably be detected as a namespace_operator regardless of whether or not a RHS was provided, so we can try and provide autocomplete

I'm currently struggling with the lack of whitespace tokens in TS, and trying to use the empty token as a substitute to figure out where the cursor is. One difficulty is that the empty token might be behind the cursor as the latter is inserted after newlines but not after spaces.

Regarding optional bodies, it sounds like the lack of any identifiable node at point will make it harder to write generic algorithms. For instance if an algorithm is looking for the parent of the node at point, and there is no node at point after e.g. repeat\n <CURSOR>, then we're left with selecting repeat but that's not the right starting point.

That's why I'm considering wrapping the TS tree with an actual CST where whitespace is represented with actual tokens that have parents. I think this would make it much easier to reason about the position of a cursor. There's something elegant about being able to reproduce the original source code exactly by collecting tokens in order.

To avoid complex ad hoc code in the generation of this CST though, it would be nice if our TS parser was as consistent and as helpful as possible regarding node spans with missing children.

As of #132 we no longer have near as many strange optional() cases, so the case of the function() with no body behaves more as you'd expect (we can also finally report the recovered node as MISSING now)

library(treesitter)
library(treesitter.r) # devel version

text <- "function()
  "

parser_parse(parser, text)
#> <tree_sitter_tree>
#> 
#> ── Text ───────────────────────────────────────────────────────────────────────────────────────────
#> function()
#>   
#> 
#> ── S-Expression ───────────────────────────────────────────────────────────────────────────────────
#> (program [(0, 0), (1, 2)]
#>   (function_definition [(0, 0), (1, 0)]
#>     name: "function" [(0, 0), (0, 8)]
#>     parameters: (parameters [(0, 8), (0, 10)]
#>       open: "(" [(0, 8), (0, 9)]
#>       close: ")" [(0, 9), (0, 10)]
#>     )
#>     body: (identifier MISSING [(1, 0), (1, 0)])
#>   )
#> )

The injected missing body node is still at (1, 0) though, and does not consume all the whitespace. I'm pretty sure we are going to have to live with that?

Hopefully we can fix in the conversion from ts to rowan?

Yea I think I'm going to close this as I don't think it will be fixed in tree-sitter, but we can always revisit if we decide we can't work around it elsewhere