Strange behavior with strings?
jrfaller opened this issue ยท 6 comments
Hi, and thanks a lot for the amazing work on this PHP parser!
I am writing a generic tree transformer for tree-sitter grammars in order to integrate them in a structural diff tool (https://github.com/GumTreeDiff/gumtree) and I was currently trying to integrate PHP thanks to your awesome grammar.
I noticed something weird for the case of strings. If I parse
<?php foo("test") ?>
I obtain (syntax is this: type: associated_text [start_pos,end_pos]
, I only retain the associated text on leafs)
program [0,21]
php_tag: <?php [0,5]
expression_statement [6,17]
function_call_expression [6,17]
name: foo [6,9]
arguments [9,17]
(: ( [9,10]
argument [10,16]
encapsed_string [10,16]
": " [10,11]
string: test [11,15]
": " [15,16]
): ) [16,17]
text_interpolation [18,20]
?>: ?> [18,20]
Which is perfect! (perhaps thinking again I am not sure to understand the text_interpolation node but this is not a problem for my usage)
However for the program
<?php foo('test') ?>
I get
php_tag: <?php [0,5]
expression_statement [6,17]
function_call_expression [6,17]
name: foo [6,9]
arguments [9,17]
(: ( [9,10]
argument [10,16]
string [10,16]
': ' [10,11]
': ' [15,16]
): ) [16,17]
text_interpolation [18,20]
?>: ?> [18,20]
Notice that in this case the value associated to the string
production is not in any child of the string
typed node, as it was the case for the encapsed_string
and it prevents me to associate the string value to the string node since I retain only the text for leaf nodes.
Is it a normal behavior? Would it be too much hassle to make it work like encapsed_string
work?
Cheers!
OK I have checked with several other grammars and it seems a common behavior in fact for instance print("string")
in python gives :
module [0,13]
expression_statement [0,12]
call [0,12]
identifier: print [0,5]
argument_list [5,12]
(: ( [5,6]
string [6,11]
": " [6,7]
": " [10,11]
): ) [11,12]
so I guess I will have to adjust the behavior of my tree transformer!
Thanks for reporting this! Would it be sufficient for your use case if we added a 'value' field to the actual string body?
I can totally live with a tree rewrite on my end, and I do not want to mess with the structure of your grammar if it does not make sense. But by looking into it, I think that maybe there is something a little strange in the current way of handling string, which make them not really homogeneous:
- there is
encapsed_string
typed nodes with opening and closing"
andstring
children - there is
string
typed nodes with opening and closing'
children, not to be confounded with thestring
typed node that can be the child of anencapsed_string
and have no child
Maybe it would be easier to process to have
encapsed_string
typed nodes with opening and closing"
andstring_value
children (or something like that)string
typed nodes with opening and closing'
andstring_value
children
Again, I don't want you to mess up the grammar, I don't have the full picture of the productions involving strings and maybe there are some additional children in some cases I did not see so if you think it does not make sense don't hesitate to close the issue ^^
Cheers!
Thank you for your detailed explanation and for being so kind and polite about it! ๐
What you're pointing out is an inconsistency I've been meaning to address, but haven't yet gotten around to. I think your proposal of adding a child node to string
and reusing this inside encapsed_string
where we're using string
directly today is a good solution.
@maxbrunsfeld Any objections to adding a child node to our current string
node? This will be a breaking change from previous releases when it comes to this particular node. However, note that we've had to make other (currently unreleased) breaking changes already, like the precedence change for .
(I'll also have to admit being the one who introduced this inconsistency when I was new to parser design/development)
Yeah, I'm fine with adding a string_value
child, or something like that.
I'm just guessing here since I'm new to tree-sitter,
I guess this is considered a correct result?
EDIT: This seems to be a difference between the first one being a string
and the second one being a encapsed_string
, so it's probably not a parser error. It's probably something missing somewhere else... Like somewhere in the Emacs wrapper for tree-sitter or maybe the theme I'm using for Emacs.