tree-sitter/tree-sitter-bash

Incorrect result of ts_node_first_child_for_pos in a heredoc_body

casouri opened this issue · 8 comments

In a following source:

echo <<EOF
text1 $var
text2 $(echo cmd)
EOF

The return value of ts_node_first_child_for_byte(heredoc_body, 22), where heredoc_body is the heredoc_body node, and 22 is at the beginning of "text2", is expected to be the command_substitution node representing $(echo cmd). However, the function returns a null node.

But if I evaluate ts_node_first_child_for_byte(heredoc_body, 12), where 12 is at the beginning of "text1", the return value is expected: the simple_expansion representing $var. So I don't know what could be the cause.

Here is the program I used to test this:

#include <string.h>
#include <stdio.h>
#include <tree_sitter/api.h>

TSLanguage *tree_sitter_bash();

int main() {

  TSParser *parser = ts_parser_new();
  ts_parser_set_language(parser, tree_sitter_bash());


  char *source = "echo <<EOF\n"
	"text1 $var\n"
	"text2 $(echo cmd)\n"
	"EOF\n";


  TSTree *tree = ts_parser_parse_string(parser, NULL, source, strlen(source));

  TSNode root = ts_tree_root_node(tree);
  TSNode heredoc_body = ts_node_child(root, 2);
  /* 12 = beginning of "text1".  */
  TSNode child_at_12 = ts_node_first_child_for_byte(heredoc_body, 12);
  /* 12 = beginning of "text2".  */
  TSNode child_at_22 = ts_node_first_child_for_byte(heredoc_body, 22);

  if (!ts_node_is_null(child_at_12))
	printf("Child at 12 (beg of \"text1\") is %s\n", ts_node_type(child_at_12));
  if (ts_node_is_null(child_at_22))
	printf("Child at 22 (beg of \"text2\") is null: unexpected\n");
  printf("Parse tree:\n%s\n", ts_node_string(root));
  return 0;
}

Compiled using

gcc bash-bug.c -o bash-bug -ltree-sitter -ltree-sitter-bash -L.

If I run it, I get

Child at 12 (beg of "text1") is simple_expansion
Child at 22 (beg of "text2") is null: unexpected
Parse tree:
(program (redirected_statement body: (command name: (command_name (word))) redirect: (heredoc_redirect (heredoc_start))) (heredoc_body (simple_expansion (variable_name)) (command_substitution (command name: (command_name (word)) argument: (word)))))

I'm using the latest master branch of tree-sitter-bash and tree-sitter 0.20.7. TIA!

Thanks for the detailed report @casouri.

@maxbrunsfeld

Hello, I believe I have found a similar issue in tree-sitter-clojure: sogaiu/tree-sitter-clojure#32

I have based the reproduction recipe off of the one @casouri submitted. I'm not sure what we can do about it in the grammar itself. Now that it is seen here and in the clojure grammar I am wondering if it is a bug in tree-sitter itself.

@casouri I continued investigating @dannyfreeman's sogaiu/tree-sitter-clojure#32 case and have turned up some bits that might be relevant to this issue.

I've summarized a bit here.

Update: On second thought, it might be better to disregard the above for now -- I'm not seeing a difference in debug graph output for this issue. Sorry for the false alarm.

@casouri @maxbrunsfeld @dannyfreeman

The current implementation of ts_node__first_child_for_byte is:

static inline TSNode ts_node__first_child_for_byte(
  TSNode self,
  uint32_t goal,
  bool include_anonymous
) {
  TSNode node = self;
  bool did_descend = true;

  while (did_descend) {
    did_descend = false;

    TSNode child;
    NodeChildIterator iterator = ts_node_iterate_children(&node);
    while (ts_node_child_iterator_next(&iterator, &child)) {
      if (ts_node_end_byte(child) > goal) {
        if (ts_node__is_relevant(child, include_anonymous)) {
          return child;
        } else if (ts_node_child_count(child) > 0) {
          did_descend = true;
          node = child;
          break;
        }
      }
    }
  }

  return ts_node__null();
}

IIUC, this implementation doesn't appear to handle this issue's case or that of sogaiu/tree-sitter-clojure#32 because there is some "climbing back up" that needs to happen (which doesn't).

Here's the tail end of debug graph output I get for the source text in the sample code:

issue-139-smaller

In the figure above, there comes a time when the lower heredoc_body_repeat1 is examined. After looking at its two children (simple_expansion and _heredoc_body_middle), because the second of the children -- _heredoc_body_middle -- does not have any children (note for future that did_descend does not get set to true and remains false), the inner while loop is reached a third time, but then ts_node_child_iterator_next() returns false. This ultimately leads to the whole function returning the value of ts_node__null() (as did_descend is false so the outer while loop is exited). It seems to me that rather than do that, there should be some "climbing back up" which should lead to examining command_substitution.

I tried replacing the current implementation of ts_node__first_child_for_byte with this potentially much less efficient (but possibly a bit better behaved) implementation:


static inline TSNode ts_node__first_child_for_byte(
  TSNode self,
  uint32_t goal,
  bool include_anonymous
) {
  TSNode child;
  uint32_t count = ts_node_child_count(self);

  for (uint32_t i = 0; i < count; i++) {
    child = ts_node_child(self, i);
    if (ts_node_end_byte(child) > goal) {
      if (ts_node__is_relevant(child, include_anonymous)) {
        return child;
      }
    }
  }

AFAICT, with this alternative implementation, the sample code seems to behave better (though it may not be efficient if ts_node_child_count gets called repeatedly). This alternative implementation also yielded better results for sogaiu/tree-sitter-clojure#32.

I take it that calling ts_node_child repeatedly is undesirable as it appears similar work will get performed unnecessarily in each subsequent call as subsequent children are requested.

Here's the log.html I got for reference: log.html.gz

image

The current heredoc implementation ate the words text1 and text2 in the body of the heredoc.
/cc: @amaanq

it's because it's _heredoc_body_middle which is hidden, do we want to expose that as heredoc_content?

var=VAR
cat <<EOF
    text1 $var
    text2 $(echo cmd)
EOF

image
image

Lost of the prefix words.


cat <<"EOF"
    text1 $var
    text2 $(echo cmd)
EOF

image
image

Lost of the first heredoc line.

it's because it's _heredoc_body_middle which is hidden, do we want to expose that as heredoc_content?

Yes we want, for all visible things there should be nodes in the tree.

I think the _heredoc_body_middle should be exposed like heredoc_literal.

Also the heredoc_body shouldn't be a single multiline node that may represent the whole heredoc body and instead of it should contain one or more heredoc_literal nodes for every heredoc line (without trailing \n).

Between heredoc_literal nodes there may be various variable or command expansion nodes.

/cc: @amaanq