`next`: Can multi-line nodes with missing RHS span all whitespace?
lionel- opened this issue · 7 comments
Consider:
text <- "1 +
"
parser_parse(parser, text)
(program [(0, 0), (1, 2)]
(binary_operator [(0, 0), (1, 0)]
lhs: (float [(0, 0), (0, 1)])
operator: "+" [(0, 2), (0, 3)]
rhs: (identifier [(1, 0), (1, 0)])
)
)
and
text <- "1 +
"
parser_parse(parser, text)
(program [(0, 0), (2, 2)]
(binary_operator [(0, 0), (2, 0)]
lhs: (float [(0, 0), (0, 1)])
operator: "+" [(0, 2), (0, 3)]
rhs: (identifier [(2, 0), (2, 0)])
)
)
IIUC (I'm just assuming) the empty identifier token is inserted in the tree to avoid introducing an error?
It seems to be inserted after all trailing newlines. Could it be inserted after trailing whitespace as well? This would make it easier to find spanning nodes when the cursor is in this whitespace.
It seems the empty token isn't a general principle that can be counted on though. Is it intended that the following case doesn't produce an empty token for body
?
text <- "function()
"
parser_parse(parser, text)
(program [(0, 0), (1, 2)]
(function_definition [(0, 0), (1, 0)]
name: "function" [(0, 0), (0, 8)]
parameters: (parameters [(0, 8), (0, 10)]
"(" [(0, 8), (0, 9)]
")" [(0, 9), (0, 10)]
)
)
)
In any case, it would be helpul if all nodes that greedily swallow whitespace could span the largest whitespace span as possible. For instance in this case I would expect the function node to span [(0, 0), (1, 2)]
.
For function_definition
, the body
is optional()
so if there isn't one then it just doesn't appear at all, and that isn't an error
Line 186 in 03e6c38
For binary operators like +
, the RHS is not optional
Line 406 in 03e6c38
So, yes, the insertion of the zero width rhs: (identifier [(1, 0), (1, 0)])
is tree-sitter's error recovery in action. It detects that it "looks like" we are in a binary +
node, and it determines that we need to have some R $._expression
on the RHS to complete the +
node, even if it isn't there. I'm not sure why it chooses identifier
over some other named node that make up the _expression
choice()
call (
Line 492 in 03e6c38
In theory we should be able to show that (identifier [(1, 0), (1, 0)])
as MISSING
as well, but due to a tree-sitter bug we cannot detect that using the API at the moment.
I am mixed on our usage of optional()
, as:
- It allows us to detect partial code like
function()
asfunction_definition
s without relying on tree-sitter's error recovery, which is good but not always right - At the high cost of making the grammar more ambiguous in many cases
- And the cost of requiring usage of
prec.right()
in some cases, where if we followed the R bison grammar we'd useprec.left()
. The tree-sitter maintainers have said that you shouldn't try and exactly translate a bison grammar, but it is still a useful reference.
see this for a discussion of the prec.right/left issue
Lines 16 to 37 in 03e6c38
We may be forced to use optional()
in some cases though, as we want dplyr::
to reliably be detected as a namespace_operator
regardless of whether or not a RHS was provided, so we can try and provide autocomplete
I'm currently struggling with the lack of whitespace tokens in TS, and trying to use the empty token as a substitute to figure out where the cursor is. One difficulty is that the empty token might be behind the cursor as the latter is inserted after newlines but not after spaces.
Regarding optional bodies, it sounds like the lack of any identifiable node at point will make it harder to write generic algorithms. For instance if an algorithm is looking for the parent of the node at point, and there is no node at point after e.g. repeat\n <CURSOR>
, then we're left with selecting repeat
but that's not the right starting point.
That's why I'm considering wrapping the TS tree with an actual CST where whitespace is represented with actual tokens that have parents. I think this would make it much easier to reason about the position of a cursor. There's something elegant about being able to reproduce the original source code exactly by collecting tokens in order.
To avoid complex ad hoc code in the generation of this CST though, it would be nice if our TS parser was as consistent and as helpful as possible regarding node spans with missing children.
As of #132 we no longer have near as many strange optional()
cases, so the case of the function()
with no body behaves more as you'd expect (we can also finally report the recovered node as MISSING
now)
library(treesitter)
library(treesitter.r) # devel version
text <- "function()
"
parser_parse(parser, text)
#> <tree_sitter_tree>
#>
#> ── Text ───────────────────────────────────────────────────────────────────────────────────────────
#> function()
#>
#>
#> ── S-Expression ───────────────────────────────────────────────────────────────────────────────────
#> (program [(0, 0), (1, 2)]
#> (function_definition [(0, 0), (1, 0)]
#> name: "function" [(0, 0), (0, 8)]
#> parameters: (parameters [(0, 8), (0, 10)]
#> open: "(" [(0, 8), (0, 9)]
#> close: ")" [(0, 9), (0, 10)]
#> )
#> body: (identifier MISSING [(1, 0), (1, 0)])
#> )
#> )
The injected missing body
node is still at (1, 0)
though, and does not consume all the whitespace. I'm pretty sure we are going to have to live with that?
Hopefully we can fix in the conversion from ts to rowan?
Yea I think I'm going to close this as I don't think it will be fixed in tree-sitter, but we can always revisit if we decide we can't work around it elsewhere