elixir-lang/tree-sitter-elixir

allow whitespace to be a _terminator where possible?

the-mikedavis opened this issue · 2 comments

I've been fiddling with the (H)EEx queries lately trying to find something that works really well. There are some tough cases that I think might be solveable with changes to this grammar.

The main issue is combined injections. Consider some EEx like so:

<%= Enum.map(@charges, fn charge -> %>
  <%= humanize_money(charge) %>
<% end) %>

Here lines 1 and 3 need injection.combined (like with tree-sitter-iex) so that the end is parsed as part of the same document as the fn .. ->. Line 2 can be parsed regularly (no combined injections). For the combined parts, this looks like so to tree-sitter-elixir:

Enum.map(@charges, fn charge ->  end) 

That case works pretty well with the current grammar and queries in tree-sitter-eex, but injection.combined introduces some problems. Consider this EEx:

<%= Enum.map(@charges, fn charge -> %><% end) %>
<%= Enum.map(@charges, fn charge -> %><% end) %>

With the current grammar and queries, this produces an (ERROR) on the second Enum. This happens because the EEx grammar is consuming the newline. To tree-sitter-elixir, this example looks like:

Enum.map(@charges, fn charge -> end)  Enum.map(@charges, fn charge -> end) 

How does tree-sitter-embedded-template fix this?

In this example of an ejs template from the expressjs repo,

  <% posts.forEach(function(post) { %>
    <dt><%= post.title %></dt>
    <dd><%= post.body %></dd>
  <% }) %>

All of the inner JavaScript contents are injection.combined, so this comes out like so for tree-sitter-javascript

posts.forEach(function(post) { post.title post.body })

Which tree-sitter-javascript parses as if it were valid JavaScript (which, to my knowledge, it is not). This happens in tree-sitter-ruby as well with erb templates: the grammars are intentionally more permissive around newlines and whitespace than the language actually allows.

Possible solutions

I see a few ways around this to make the EEx injections work as well as ejs/erb.

The first is to try to allow spaces to be $._terminators whenever it doesn't introduce abiguity. I tried some minor edits to the grammar and it seems like this approach is easier said than done because of things like function calls without parens. This would also sacrifice the current parity between this parser and the parser in the Elixir compiler, which seems a bit distasteful to me.

The second would be to propose a change to injections upstream in tree-sitter to try to introduce a way to limit the scope of a combined injection. Currently all injections which are marked as combined globally in a document are parsed together in one "combined" sub-document. If the EEx grammar were rewritten as in this branch to group EEx tags based on start/middle-expressions and end-expressions, and then if we could combine only the tags within each group, then we wouldn't need to care about $._terminators at all. In the example above, tags one and two would be grouped and three and four grouped, but only tags one and two woud be combined with one another, and tags three and four with one another. This is nice because it mimics the EEx.Tokenizer, but it's unclear if a change like this would be accepted upstream in tree-sitter since no other language has needed anything like it.

Thirdly there's a way to fix this particular case by parsing newlines in the eex grammar and handing them over to the elixir grammar when injecting (see here) but it's not general enough to handle some odd cases, like if the example above had all four tags on the same line.

The last resort as I see it would be to rewrite the EEx grammar to depend on the Elixir grammar, so it can have fine-grained control over how expressions within EEx tags are terminated. This seems potentially brittle and certainly more of a maintenance burden, though, so I think this should be avoided if possible.

What do you think @jonatanklosko?

see also connorlay/tree-sitter-eex#1
connects #2

This would also sacrifice the current parity between this parser and the parser in the Elixir compiler, which seems a bit distasteful to me.

There is some precedence for this in the "offlcial grammars" though:

tree-sitter/tree-sitter-rust#70

This does permit things like raw loop labels, which are technically not allowed, but I think that's ok. In general, I prefer to err on the side of performance/simplicity rather than making sure to disallow things that the spec disallows.

@the-mikedavis thanks a lot for investigating this!

Keeping newlines for the injection conceptually sounds like a good idea, but we should focus on the most problematic cases first. I'm gonna list a few to make sure I got this right.

Case 1.

<%= foo %><%= bar %>

the Elixir code that we parse is:

foo bar

Case 2.

<%=format(x)%>: <%=format(y)%>

becomes

format(x)format(y)

Case 3.

And similarly

<%=100%><%=variable%>

becomes

100variable

Realistically speaking case 1. is the most concerning, because it's very likely to appear. And by gluing the expressions together, we would always highlight one as function and the other as the argument, which is inconsistent.

It's also possible that someone doesn't use whitespace inside EEx interpolation (cases 2. and 3.), so then it doesn't matter how strict/loose we are about whitespace/terminators, we're gonna get invalid syntax.


The idea of parsing injections in groups sounds like a good direction (assuming it would be possible upstream), however then there is a question if we can group the expressions in a robust way. I'm not sure if we can cover everything with simple checks like do/->/fn, but this may be our best shot.


This seems to be a very general topic with respect to injections, so cc-ing @maxbrunsfeld for any high-level thoughts and to validate if grouping injections seem viable to implement in tree-sitter.