different nodes for multi-clause directives?
the-mikedavis opened this issue ยท 5 comments
๐ happy new year @connorlay! Sorry for this massive wall of text I'm dropping in your issue tracker ๐
Helix is right on the cusp of having combined injections (helix-editor/helix#1378) which make injections for multi-directive expressions possible (e.g. <%= if true do %>..<% end %>
having its elixir expressions joined and handled as one combined tree-sitter node).
I noticed that there's some odd behavior with heex and eex with combined injections because the $.code
nodes don't contain a trailing newline, which ends up making non-multi-directive expressions (expressions not split across multiple directives) get parsed as errors by the elixir grammar (in which newlines are syntactically important). For example, I notice some errors popping up when you have two function calls right next to each other:
So the directives are being combined and the resulting elixir code comes out to:
csrf_meta_tag() live_title_tag assigns[:page_title] || "Foo", suffix: " ยท Foo"
Which the Elixir parser recognizes as having errors. Put in semicolons to separate the expressions, though, and it all looks great:
Why doesn't this happen with tree-sitter-embedded-template for the injected ruby (erb) and javascript (ejs)? I looked into those grammars and they're super permissive about newlines, so even take an ejs template like this one from the expressjs repo:
<% posts.forEach(function(post) { %>
<dt><%= post.title %></dt>
<dd><%= post.body %></dd>
<% }) %>
Which comes out to this javascript:
posts.forEach(function(post) { post.title post.body })
And tree-sitter-javascript says that this is valid, even though it's in fact nonsense javascript afaik. The same happens for erb/ruby. So really one approach for fixing this problem is for me to open up an issue/PR in tree-sitter-elixir that makes newline less syntactically important. (I really like how close-to-the-parser tree-sitter-elixir currently is, though, so I'd prefer to avoid doing that.)
Initially I was thinking that it might make sense for there to be a new directive in tree-sitter's query language for specifying a character used to join combined expressions (I think \n
would work in this case), but that got me curious how the HEEx and ultimately EEx compiler do this. Are those compilers inserting newlines? That led me to EEx.Tokenizer
(see this clause of tokenize/6
and token_key/2
). The tokenizer separates out directives into the records :start_expr
, :middle_expr
and :end_expr
based on the contained elixir tokens when the expression spans multiple directives, and :expr
otherwise. So my current thinking is that it'd be best for tree-sitter-eex and tree-sitter-heex (and maybe tree-sitter-surface) to do this same separation so that you would have a node type for complete expressions ($.directive
?) and a separate one for partially valid expressions that need to be combined ($.multi_clause_directive
?). That way you could write a query for the injections.scm like
([
(expression (code) @injection.content) ; for heex
(directive (code) @injection.content)
]
(#set! injection.language "elixir"))
((multi_clause_directive (code) @injection.content)
(#set! injection.combined)
(#set! injection.language "elixir"))
I'm a little worried about how difficult this could be implementation-wise. Would eex and heex (and surface?) grammars need to take a dependency on the elixir grammar to correctly determine which directives are partial expressions? Can we accomplish something reasonably close without pulling in the grammar? I'll hack around with this grammar and see if this is easier than I'm thinking ๐ต๏ธ
If I'm reading EEx.Tokenizer rules correctly from token_keys/2:
if the expression... | then it's a |
---|---|
starts with "end" and ends with "do" | middle_expr |
ends with a "do" | start_expr |
ends with a block-identifier (?) | middle_expr |
starts with an "end" and ends with a stab operator ("->") | middle_expr |
ends with a stab operator and (other more complicated rules) | start_expr/middle_expr |
"end" preceeded by /[\[\(\{]*/ |
end_expr |
otherwise | expr |
Despite my github-code-search-fu I can't seem to figure out what a block_identifier token is ๐ค
And that "end preceeded by..." rule code looks a little weird to me: why is closing_bracket?/1
checking against ~w"( [ {"a
? I can't really imagine any code that would look like that.
I think we can map that out to simpler rules without pulling in the elixir grammar given that we may not care what's a start/middle/end and just that it is/is-not an expr.
It's a multi-clause directive if the code:
- ends with "do", block-identifier, or "->"
- starts with
seq(/[\[\(\{]*/, "end")
the closing_bracket?/1 thing might be a bug, I'll investigate that separately elixir-lang/elixir@94fa4e5#diff-37ace2010a0277f8d6f8c940c500ceeecb5adf03ca2609fef69aa8b3b0f5a391L14
Hi @the-mikedavis, happy 2022 and thank you for the investigation into this issue! I agree we should avoid embedding the entire tree-sitter-elixir grammar if possible. I'll review your PRs and see if I spot any issues ๐
Ok I think I did not think this all the way through ๐
There's this case that comes up here and there on large templates:
Which would look like this to the elixir grammar because of the combined injections
I'm not sure what the best place to fix this is. On the one hand, like the javascript+jsx example above, it could make sense to fix this by having the elixir grammar allow space to be a $._terminator
, making the elixir grammar more permissive. On the other hand it might be fixable in this grammar by splitting up the new multi-clause directives into start/middle/end-expr like the EEx.Tokenizer does. That way you could write the injections like
((start_expression) @injection.content
(middle_expression)* @injection.content
(end_expression) @injection.content
(#set! injection.combined)
(#set! injection.include-children)
(#set! injection.language "elixir"))
for multi-clause directives, and you would only be combining each expression, so you wouldn't need any change in the elixir grammar.
I'll give this a try here, but if that doesn't work I'll open an issue in the elixir grammar ๐
I tried out reorganizing the expressions so you have (directive_group)
s that start with a start/middle directive, contain pretty much anything, and end with an ending directive, (branch is here) but unfortunately this query:
((directive_group (directive (partial_expression) @injection.content))
(#set! injection.combined)
(#set! injection.include-children)
(#set! injection.language "elixir"))
doesn't work the way I hoped. It combines the (partial_expression)
s from all (directive_group)
s despite them being different matches, so the problem persists.
I think this means that it either has to be solved in the elixir grammar or injections need a new #set!
property to change the behavior. I'll noodle on it some more but I don't think it's possible to fix this only with changes to the eex grammar