tree-sitter/tree-sitter-php

Strange behavior with strings?

jrfaller opened this issue ยท 6 comments

Hi, and thanks a lot for the amazing work on this PHP parser!

I am writing a generic tree transformer for tree-sitter grammars in order to integrate them in a structural diff tool (https://github.com/GumTreeDiff/gumtree) and I was currently trying to integrate PHP thanks to your awesome grammar.

I noticed something weird for the case of strings. If I parse

<?php foo("test") ?>

I obtain (syntax is this: type: associated_text [start_pos,end_pos], I only retain the associated text on leafs)

program [0,21]
    php_tag: <?php [0,5]
    expression_statement [6,17]
        function_call_expression [6,17]
            name: foo [6,9]
            arguments [9,17]
                (: ( [9,10]
                argument [10,16]
                    encapsed_string [10,16]
                        ": " [10,11]
                        string: test [11,15]
                        ": " [15,16]
                ): ) [16,17]
    text_interpolation [18,20]
        ?>: ?> [18,20]

Which is perfect! (perhaps thinking again I am not sure to understand the text_interpolation node but this is not a problem for my usage)

However for the program

<?php foo('test') ?>

I get

    php_tag: <?php [0,5]
    expression_statement [6,17]
        function_call_expression [6,17]
            name: foo [6,9]
            arguments [9,17]
                (: ( [9,10]
                argument [10,16]
                    string [10,16]
                        ': ' [10,11]
                        ': ' [15,16]
                ): ) [16,17]
    text_interpolation [18,20]
        ?>: ?> [18,20]

Notice that in this case the value associated to the string production is not in any child of the string typed node, as it was the case for the encapsed_string and it prevents me to associate the string value to the string node since I retain only the text for leaf nodes.

Is it a normal behavior? Would it be too much hassle to make it work like encapsed_string work?

Cheers!

OK I have checked with several other grammars and it seems a common behavior in fact for instance print("string") in python gives :

module [0,13]
    expression_statement [0,12]
        call [0,12]
            identifier: print [0,5]
            argument_list [5,12]
                (: ( [5,6]
                string [6,11]
                    ": " [6,7]
                    ": " [10,11]
                ): ) [11,12]

so I guess I will have to adjust the behavior of my tree transformer!

Thanks for reporting this! Would it be sufficient for your use case if we added a 'value' field to the actual string body?

I can totally live with a tree rewrite on my end, and I do not want to mess with the structure of your grammar if it does not make sense. But by looking into it, I think that maybe there is something a little strange in the current way of handling string, which make them not really homogeneous:

  • there is encapsed_string typed nodes with opening and closing " and stringchildren
  • there is string typed nodes with opening and closing ' children, not to be confounded with the string typed node that can be the child of an encapsed_string and have no child

Maybe it would be easier to process to have

  • encapsed_string typed nodes with opening and closing " and string_value children (or something like that)
  • string typed nodes with opening and closing ' and string_value children

Again, I don't want you to mess up the grammar, I don't have the full picture of the productions involving strings and maybe there are some additional children in some cases I did not see so if you think it does not make sense don't hesitate to close the issue ^^

Cheers!

Thank you for your detailed explanation and for being so kind and polite about it! ๐Ÿ™‚

What you're pointing out is an inconsistency I've been meaning to address, but haven't yet gotten around to. I think your proposal of adding a child node to string and reusing this inside encapsed_string where we're using string directly today is a good solution.

@maxbrunsfeld Any objections to adding a child node to our current string node? This will be a breaking change from previous releases when it comes to this particular node. However, note that we've had to make other (currently unreleased) breaking changes already, like the precedence change for .

(I'll also have to admit being the one who introduced this inconsistency when I was new to parser design/development)

Yeah, I'm fine with adding a string_value child, or something like that.

etu commented

I'm just guessing here since I'm new to tree-sitter,

I guess this is considered a correct result?
image

EDIT: This seems to be a difference between the first one being a string and the second one being a encapsed_string, so it's probably not a parser error. It's probably something missing somewhere else... Like somewhere in the Emacs wrapper for tree-sitter or maybe the theme I'm using for Emacs.