Incorrect result of ts_node_first_child_for_byte in source node
dannyfreeman opened this issue · 10 comments
I've discovered a bug that is very similar to tree-sitter/tree-sitter-bash#139
It can be reproduced with this C file
#include <string.h>
#include <stdio.h>
#include <tree_sitter/api.h>
TSLanguage *tree_sitter_clojure();
void print_src(char *source) {
printf("\n---------\n");
char form1[11] = { 0 };
strncpy(form1, source, 10);
printf("%s", form1); // includes 1 newline
char form2[12] = { 0 };
strncpy(form2, source+10, 11);
printf("%s", form2);
char form3[13] = { 0 };
strncpy(form3, source+21, 12);
printf("%s", form3);
printf("---------\n\n");
return;
}
int main() {
TSParser *parser = ts_parser_new();
ts_parser_set_language(parser, tree_sitter_clojure());
char *source =
"(def x a)\n" // 0-9
"\n" // 10
"{def y 2}\n" // 11-21
"\n" // 22
"[def z 3]\n"; // 23-33
print_src(source);
TSTree *tree = ts_parser_parse_string(parser, NULL, source, strlen(source));
TSNode root = ts_tree_root_node(tree);
TSNode child_x = ts_node_first_child_for_byte(root, 0);
TSNode child_y = ts_node_first_child_for_byte(root, 10);
TSNode child_z = ts_node_first_child_for_byte(root, 22);
if (!ts_node_is_null(child_x))
printf("Child at 0 (beg of \"(def x a)\") is %s\n", ts_node_type(child_x));
else
printf("Child at 0 is null: unexpected\n");
if (!ts_node_is_null(child_y))
printf("Child at 10 (beg of \"{def y 2}\") is %s\n", ts_node_type(child_y));
else
printf("Child at 10 is null: unexpected\n");
if (!ts_node_is_null(child_z))
printf("Child at 22 (beg of \"[def z 3]\") is %s\n", ts_node_type(child_z));
else
printf("Child at 22 is null: unexpected");
printf("\nParse tree:\n%s\n", ts_node_string(root));
return 0;
}To build, I save this C file as bug.c in the root directory of tree-sitter-clojure.
I have the tree-sitter repo cloned as a sibling directory of tree-sitter-clojure.
In the tree-sitter repo, I run
makeThen in the tree-sitter-clojure repo I run
cc bug.c -o bug -I ../tree-sitter/lib/include src/parser.c ../tree-sitter/libtree-sitter.a && ./bug
The output is:
Child at 0 (beg of "(def x a)") is list_lit
Child at 10 is null: unexpected
Child at 22 (beg of "[def z 3]") is vec_lit
Where I would expect it to be
Child at 0 (beg of "(def x a)") is list_lit
Child at 10 (beg of "{def y 2}") is map_lit
Child at 22 (beg of "[def z 3]") is vec_lit
This bug manifests itself in emacs 29 in the treesit-end-of-defun command, which will skip to the end of the file from positions 10 or 11.
M-x goto-char RET 10 RET C-M-e
M-x goto-char RET 11 RET C-M-e
A detailed bug report for Emacs can be found in the mailing list here: https://lists.gnu.org/archive/html/bug-gnu-emacs/2022-12/msg01653.html
Thanks for the detailed report and investigation.
I was able to reproduce via Emacs as well as via your .c code.
To add slightly more detail regarding the trail in Emacs, using edebug I saw that calling treesit-end-of-defun eventually results in a call to treesit-node-first-child-for-pos here. IIUC, that function is implemented in C here. I guess that results in a call to ts_node_first_child_for_byte here, though I did not observe that part firsthand.
Since it seems to also occur in tree-sitter/tree-sitter-bash#139, perhaps it is a tree-sitter issue.
I tried modifying the .c program a bit to probe the source string more fully:
#include <string.h>
#include <stdio.h>
#include <tree_sitter/api.h>
TSLanguage *tree_sitter_clojure();
int main() {
TSParser *parser = ts_parser_new();
ts_parser_set_language(parser, tree_sitter_clojure());
char *source =
"(def x a)\n" // 0-9
"\n" // 10
"{def y 2}\n" // 11-20
"\n" // 21
"[def z 3]\n"; // 22-31
printf("---------\n");
printf("%s", source);
printf("---------\n\n");
TSTree *tree = ts_parser_parse_string(parser, NULL, source, strlen(source));
TSNode root = ts_tree_root_node(tree);
TSNode child;
for (int i = 0; i < (int)strlen(source); ++i) {
child = ts_node_first_child_for_byte(root, i);
if (!ts_node_is_null(child)) {
printf("Child at %i is %s %s\n", i, ts_node_type(child), ts_node_string(child));
} else {
printf("Child at %i is null\n", i);
}
}
printf("\nParse tree:\n%s\n", ts_node_string(root));
return 0;
}
I got the following output here:
---------
(def x a)
{def y 2}
[def z 3]
---------
Child at 0 is list_lit (list_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)))
Child at 1 is list_lit (list_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)))
Child at 2 is list_lit (list_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)))
Child at 3 is list_lit (list_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)))
Child at 4 is list_lit (list_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)))
Child at 5 is list_lit (list_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)))
Child at 6 is list_lit (list_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)))
Child at 7 is list_lit (list_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)))
Child at 8 is list_lit (list_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)))
Child at 9 is null
Child at 10 is null
Child at 11 is map_lit (map_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 12 is map_lit (map_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 13 is map_lit (map_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 14 is map_lit (map_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 15 is map_lit (map_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 16 is map_lit (map_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 17 is map_lit (map_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 18 is map_lit (map_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 19 is map_lit (map_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 20 is vec_lit (vec_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 21 is vec_lit (vec_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 22 is vec_lit (vec_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 23 is vec_lit (vec_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 24 is vec_lit (vec_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 25 is vec_lit (vec_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 26 is vec_lit (vec_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 27 is vec_lit (vec_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 28 is vec_lit (vec_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 29 is vec_lit (vec_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 30 is vec_lit (vec_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit))
Child at 31 is null
Parse tree:
(source (list_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name))) (map_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit)) (vec_lit value: (sym_lit name: (sym_name)) value: (sym_lit name: (sym_name)) value: (num_lit)))
It seems there are at least 2 positions that yield null unexpectedly (9 and 10). Not sure how to interpret the result at index 31 -- may be that is expected.
(Note: I made a few iterations on the content above.)
That's much more helpful than my output, thanks for fixing it up. I'm not sure what to think about the result at position 31. I don't think it would cause the same kind of issue in Emacs, but for other tools I don't know.
Here is a branch with the source and a makefile: master...issue-32. Reproducing should be a matter of running make now
I tried it out and it worked fine.
Thanks!
The content below can probably be skipped, but I'm leaving it here for future reference.
I've tried a slightly simpler source file:
1
2
3
This leads to some unexpected null results via ts_node_first_child_for_byte as well.
I examined the tree produced (visible via log.html that is produced via tree-sitter parse --debug-graph).
The structure of the tree seemed odd to me.
So I repeated using:
1
2
3
4
A similar oddness seemed to be present.
I tried to manually go over the shift-reduce log to recreate the tree by hand and it seemed to not match what I was seeing as the final tree output.
Then I tried commenting out the following line in tree-sitter itself:
ts_subtree_balance(self->finished_tree, &self->tree_pool, self->language);
Using a subsequently built tree-sitter cli yielded a tree that was much less odd (though quite unbalanced perhaps unsurprisingly).
May be it's possible that there's an issue with ts_subtree_balance?
Update: Giving it some more thought, ATM, I'm thinking it's not an issue with ts_subtree_balance.
I tried an alternate implementation of ts_node__first_child_for_byte and it seemed to handle the case in this issue more appropriately -- but it is likely not satisfactorily efficient.
I posted a write-up of the some of the details in the context of tree-sitter/tree-sitter-bash#139.
Specifically for use in Emacs, it seems possible to apply Emacs' treesit.c's treesit_search_dfs (or something built on it) to do an appropriate search with a predicate in order to work-around the current situation for ts_node__first_child_for_byte.
Perhaps there are better methods than this though.
Thanks for all this really in depth research! I'll look in to work around this limitation in Emacs soon and report back with the results.
Since filing tree-sitter/tree-sitter#2012 it appears it was acknowledged (at least labeled) as a bug in tree-sitter's C-library. It also seems that Emacs folks has worked around the issue.
I wonder whether we need to track the status of the bug in this repository.
Any thoughts?
Since it's being tracked in tree-sitter upstream I'd agree it is not necessary to track here. I'll close the issue. If it ends up requiring some change on our end I can open it back up.


