scotteh/php-dom-wrapper

children() (and thus html()) does not return (direct child) textnodes

Closed this issue · 8 comments

I was working with a few elements, trying to figure out if the element was empty, or creating a new element and copying all contents from the old element into it. For this, I used html(), e.g.

if ($e->html() == '')
    do_stuff...

and

$new_e->html($e->html());

I found that this worked fine for elements containing other elements (e.g. <a><img/></a>), but not for elements containing (only) text (e.g. <a>foo</a>). In the latter case, html() would return the empty string. Text contained inside child nodes was returned properly, it was only text directly below the element examined that is missing.

I dug into this, and found this is because html() iterates over children() and generates HTML for each child, but children() does return text nodes, only elements. I traced this to this line:

$node->findXPath('child::*')

The * selector used there, matches all elements, which does not include text nodes. Changing this line to:

$node->findXPath('child::node()')

fixes this and lets children() (and thus also html()) also return text nodes.

All this was tested using version 0.6.3, but for lack of a ready PHP7.1 installation, I was not able to test with the latest master, or prepare a (tested) pull request. However, the relevant code is identical, so I assume the same applies to master as well.

Should this be fixed? Or is this behaviour intentional?

The method you are probably looking for is ‘contents()’ and should be supported in older versions.

‘contents()’ will include all nodes while ‘children()’ will only match element nodes (no text or comment nodes)

Ah, that indeed looks like what I need, thanks! Also thanks for the superswift reply. I was just about to "fix" this in our vendored php-dom-wrapper source, but now I can use the proper solution without modifying the library.

I just looked at the docs, but it seems both methods are documented identically, so that turns this into a documentation bug. I would propose a fix, but I'm a bit confused about contents() and how it handles removed nodes (

if($node->isRemoved())
return $this->newNodeList();
). In particular, I wonder what a removed node means exactly, and why the presence of such a removed node prevents all previous nodes from appearing in the output?

Hm, trying to implement this, I realized that I'm dependent on html(), rather than children() directly, so I can't just switch to contents(). Thinking on this more: Shouldn't html() call contents() instead of children() to include text nodes in its output?

Suspect you are right :)

With isRemove If there is a reference to an object that has been removed from the DOM that check essentially prevents an error/exception.

With isRemove If there is a reference to an object that has been removed from the DOM that check essentially prevents an error/exception.

But shouldn't it be sufficient to just skip the current element (e.g. return $carry) rather than throwing away the contents collected so far? Or perhaps I'm misunderstanding how reduce works here :-)

Yep, right you are. Thanks for running your keen eyes over the code :)

Should be straight forward for you to backport these changes.

Looks good, thanks!

I ran into this issue again, but now when working with create. For example, when you do this:

$document->create("<span>XXX</span>YYY");

I get a nodelist returned with just the span, not the "YYY" text node. This is because inputAsHtml also uses * to return only elements, not all nodes:

$nodes = $doc->html($html)->find('body > *');

For children(), it is apparently intentional to only return elements (and there is an alternative - contents()), but for create() (and probably also most if not all other uses of inputAsHtml), this does not seem like the wanted behavior. Would it make sense to do something like:

--- a/src/Traits/ManipulationTrait.php
+++ b/src/Traits/ManipulationTrait.php
@@ -109,7 +109,7 @@ trait ManipulationTrait
         $class = get_class($this->document());
         $doc = new $class();
         $doc->setEncoding($this->document()->getEncoding());
-        $nodes = $doc->html($html)->find('body > *');
+        $nodes = $doc->html($html)->findXPath('//body/node()');
 
         return $nodes;
     }

Looking through the code, it seems that there are two more uses of the * xpath selector: children(), which is ok, and inputAsFirstNode(). The latter is protected and only called from the wrap functions, which probably needs elements, so I guess that's also ok.

Update: Updated proposed fix to use findXPath instead of find, since node() is an XPath construct. Should also be a bit faster as well. I tested this updated version on an older version of php-dom-wrapper, but looking at the code I expect this will work on the latest version as well.