andreskrey/readability.php

PHP Fatal error: Uncaught TypeError: Argument 1 passed to iterator_to_array() must implement interface Traversable, null given

Closed this issue · 23 comments

blat commented

Hi,

This URL returns a Fatal Error: https://www.marketwatch.com//story//home-prices-are-still-on-fire-case-shiller-data-show-2018-03-27

$readability = new Readability(new Configuration());

$html = file_get_contents('https://www.marketwatch.com//story//home-prices-are-still-on-fire-case-shiller-data-show-2018-03-27');

try {
    $readability->parse($html);
    echo $readability;
} catch (ParseException $e) {
    echo sprintf('Error processing text: %s', $e->getMessage());
}

Result is:

PHP Fatal error: Uncaught TypeError: Argument 1 passed to iterator_to_array() must implement interface Traversable, null given in vendor/andreskrey/readability.php/src/Nodes/NodeTrait.php:324

Can you test this with the latest development version? #83 looks very similar and might actually be the same issue I fixed with a4bd07a

blat commented

@andreskrey Same issue with the latest stable version and the dev-development branch :(

Could not reproduce, v2.0.1 and development process the text correclty. This is what I get as result (using the same configuration as you)

<div id="article-body" itemprop="articleBody">
                            &#xD;
                            &#xD;
                            &#xD;
                                                        &#xD;
                            &#xD;
                            &#xD;
                                                        &#xD;
                            &#xD;
                            &#xD;
                                                               <p> <strong>The numbers:</strong> The S&amp;P/Case-Shiller national index rose a seasonally adjusted 0.5% in the three-month period ending in January, and was up 6.2% compared to a year before. The 20-city index rose a seasonally adjusted 0.8% for the month, and 6.4% for the year.</p> <p> <strong>What happened: </strong>Prices are still on fire. And the West is still the best: Seattle, Las Vegas and San Francisco all notched double-digit yearly price gains. Only one city, Washington, D.C., had a negative monthly reading. </p> <p>As David Blitzer, chairman of the index committee at S&amp;P Dow Jones Indices, noted in a release, the price gains are all about demand and lack of supply. </p> <p>“The current months-supply — how many months at the current sales rate would be needed to absorb homes currently for sale — is 3.4; the average since 2000 is 6.0 months, and the high in July 2010 was 11.9,” Blitzer wrote. “Currently, the homeowner vacancy rate is 1.6% compared to an average of 2.1% since 2000; it peaked in 2010 at 2.7%. Despite limited supplies, rising prices and higher mortgage rates, affordability is not a concern.”</p> <p>Relatively affordable housing is cold comfort to many would-be home buyers who simply can’t find anything to buy. </p> <p> <strong>Read:</strong> <a href="/story/two-thirds-of-house-hunters-have-been-searching-for-3-months-or-more-2018-02-22">Most house hunters have been searching for 3 months or more</a> </p> <p> <strong>Big picture:</strong> Economists had forecast a 0.7% monthly increase, and a 6.2% 12-month increase, for the 20-city index. <a href="/story/why-its-so-hard-to-forecast-home-prices-for-2018-and-why-that-should-worry-you-2017-12-19">As MarketWatch has reported</a>, most housing analysts have argued that the ongoing price gains in housing can’t last — and yet they have so far. </p> <table readabilityDataTable="1"><tbody><tr><td colspan="" id=""> <strong>Metro</strong> </td> <td colspan="" id=""> <strong>Monthly change</strong> </td> <td colspan="" id=""> <strong>12-month change</strong> </td> </tr><tr><td colspan="" id="">Atlanta</td> <td colspan="" id="">0.7%</td> <td colspan="" id="">6.5%</td> </tr><tr><td colspan="" id="">Boston</td> <td colspan="" id="">0.2%</td> <td colspan="" id="">5.3%</td> </tr><tr><td colspan="" id="">Charlotte</td> <td colspan="" id="">0.4%</td> <td colspan="" id="">6.0%</td> </tr><tr><td colspan="" id="">Chicago</td> <td colspan="" id="">0.0%</td> <td colspan="" id="">2.4%</td> </tr><tr><td colspan="" id="">Cleveland</td> <td colspan="" id="">0.0%</td> <td colspan="" id="">3.5%</td> </tr><tr><td colspan="" id="">Dallas</td> <td colspan="" id="">0.2%</td> <td colspan="" id="">6.9%</td> </tr><tr><td colspan="" id="">Denver</td> <td colspan="" id="">0.7%</td> <td colspan="" id="">7.6%</td> </tr><tr><td colspan="" id="">Detroit</td> <td colspan="" id="">0.1%</td> <td colspan="" id="">7.6%</td> </tr><tr><td colspan="" id="">Las Vegas</td> <td colspan="" id="">0.6%</td> <td colspan="" id="">11.1%</td> </tr><tr><td colspan="" id="">Los Angeles</td> <td colspan="" id="">0.6%</td> <td colspan="" id="">7.6%</td> </tr><tr><td colspan="" id="">Miami</td> <td colspan="" id="">0.6%</td> <td colspan="" id="">4.0%</td> </tr><tr><td colspan="" id="">Minneapolis</td> <td colspan="" id="">0.1%</td> <td colspan="" id="">5.9%</td> </tr><tr><td colspan="" id="">New York</td> <td colspan="" id="">0.0%</td> <td colspan="" id="">5.2%</td> </tr><tr><td colspan="" id="">Phoenix</td> <td colspan="" id="">0.3%</td> <td colspan="" id="">5.9%</td> </tr><tr><td colspan="" id="">Portland</td> <td colspan="" id="">0.4%</td> <td colspan="" id="">7.1%</td> </tr><tr><td colspan="" id="">San Diego</td> <td colspan="" id="">0.8%</td> <td colspan="" id="">7.4%</td> </tr><tr><td colspan="" id="">San Francisco</td> <td colspan="" id="">0.4%</td> <td colspan="" id="">10.2%</td> </tr><tr><td colspan="" id="">Seattle</td> <td colspan="" id="">0.7%</td> <td colspan="" id="">12.9%</td> </tr><tr><td colspan="" id="">Tampa</td> <td colspan="" id="">0.4%</td> <td colspan="" id="">6.7%</td> </tr><tr><td colspan="" id="">Washington</td> <td colspan="" id="">-0.4%</td> <td colspan="" id="">2.4%</td> </tr></tbody></table><p>Read: <a href="/story/mortgage-rates-edge-up-even-as-trade-war-worries-loom-ahead-2018-03-22">Mortgage rates edge up even as trade war worries loom ahead</a></p> &#xD;
&#xD;
                            &#xD;
                            &#xD;
                                  &#xD;
      &#xD;
      &#xD;
      &#xD;
      &#xD;
      &#xD;
      &#xD;
      &#xD;
                                  &#xD;
                            &#xD;
                            &#xD;
                                  &#xD;
      &#xD;
      &#xD;
      &#xD;
      &#xD;
      &#xD;
      &#xD;
      &#xD;
      


                
                

                            &#xD;
                                    
                &#xD;
                


            </div>

Which is also the same result I get with the JS version. Are you sure you're not passing other configuration or using an older version?

blat commented

Hi,

Thanks for checking!

Yes, I'm sure to use dev-development version (or v2.1.0), and no more configuration.

Here is full stack trace:

Fatal error: Uncaught TypeError: Argument 1 passed to iterator_to_array() must implement interface Traversable, null given in /home/mickael/test/vendor/andreskrey/readability.php/src/Nodes/NodeTrait.php:324
Stack trace:
#0 /home/mickael/test/vendor/andreskrey/readability.php/src/Nodes/NodeTrait.php(324): iterator_to_array(NULL)
#1 /home/mickael/test/vendor/andreskrey/readability.php/src/Nodes/NodeTrait.php(421): andreskrey\Readability\Nodes\DOM\DOMText->getChildren(true)
#2 /home/mickael/test/vendor/andreskrey/readability.php/src/Readability.php(1283): andreskrey\Readability\Nodes\DOM\DOMText->hasSingleTagInsideElement('tr')
#3 /home/mickael/test/vendor/andreskrey/readability.php/src/Readability.php(1179): andreskrey\Readability\Readability->prepArticle(Object(andreskrey\Readability\Nodes\DOM\DOMDocument))
#4 /home/mickael/test/vendor/andreskrey/readability.php/src/Readability.php(162): andreskrey\Readability\Readability->rateNodes(Array)
#5 /home/mickael/test/test.php(14): andreskrey\Readability\Readability->parse('<!DOC in /home/mickael/test/vendor/andreskrey/readability.php/src/Nodes/NodeTrait.php on line 324

I'm using PHP 7.3.6.

Just check on my server, with PHP 7.2.12 and it works as you describe it.

blat commented

Hi again,

I tried to downgrade PHP on my laptop, still broken:

$ php -v
PHP 7.2.12 (cli) (built: Nov  6 2018 15:07:37) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.2.0, Copyright (c) 1998-2018 Zend Technologies

$ php test.php
Fatal error: Uncaught TypeError: Argument 1 passed to iterator_to_array() must implement interface Traversable, null given in /home/mickael/test/vendor/andreskrey/readability.php/src/Nodes/NodeTrait.php:324

On my server, it works with:

DOM/XML API Version => 20031129
libXML Compiled Version => 2.9.4
libmbfl version => 1.3.2

On my laptop, it's broken with:

DOM/XML API Version => 20031129
libXML Compiled Version => 2.9.8
libmbfl version => 1.3.2

Maybe due to a change in libXML?

Huh, seems that you're using an unreleased version? Latest version in xmlsoft is 2.9.7! http://www.xmlsoft.org/news.html

Might be the reason. I'll try to compile it inside the docker container...

blat commented

I guess their website is not up to date.

2.9.8 has been released last year: https://gitlab.gnome.org/GNOME/libxml2/tree/v2.9.8
and 2.9.9 this year!

blat commented

I downgraded libXML to 2.9.4 on my laptop and re-compile php:

$ php -i | grep libXML
libXML support => active
libXML Compiled Version => 2.9.4
libXML Loaded Version => 20904
libXML streams => enabled

$ php test.php 
Fatal error: Uncaught TypeError: Argument 1 passed to iterator_to_array() must implement interface Traversable, null given in /home/mickael/test/vendor/andreskrey/readability.php/src/Nodes/NodeTrait.php:324

The truth is out there...

blat commented

I check on all environments I have access.

My workflow is simple:

$ mkdir test
$ cd test
$ composer clearcache
$ composer require andreskrey/readability.php -v dev-development

Then rsync my test.php file containing only:

<?php

require_once dirname(__FILE__) . '/vendor/autoload.php';

use andreskrey\Readability\Readability;
use andreskrey\Readability\Configuration;
use andreskrey\Readability\ParseException;

$readability = new Readability(new Configuration());

$html = file_get_contents('https://www.marketwatch.com//story//home-prices-are-still-on-fire-case-shiller-data-show-2018-03-27');

try {
    $readability->parse($html);
    echo $readability;
} catch (ParseException $e) {
    echo sprintf('Error processing text: %s', $e->getMessage());
}

Finally, just run php test.php

Result:

  • Archlinux + PHP 7.3.6 => Fatal Error
  • Debian (Strech) + PHP 7.2.12 => OK
  • Debian (Strech) + PHP 7.2.17 => OK
  • Archlinux + PHP 7.2.17 => Fatal Error
  • FreeBSD (11.2) + PHP 7.3.6 => Fatal Error

Gotcha, libxml 2.9.8 seems to be the problem. In fact, most of the test cases fail with 2.9.8.:

root@20b5378d7e68:/app# php vendor/phpunit/phpunit/phpunit 
PHPUnit 6.5.14 by Sebastian Bergmann and contributors.

...FEFF.FFFF.FEFFF.EFFF.FFFFFFFFFF.EEFFFF.FFFFFEF.F.FFFFF.F.FFF  63 / 270 ( 23%)
FFFFF..FFFE.FF.EFF.FFFFFFFF..E.........E....E...............EE. 126 / 270 ( 46%)
.........E.........................EF...E.............E........ 189 / 270 ( 70%)
.E....E...............EE..........E.........................E.. 252 / 270 ( 93%)
..E...............                                              270 / 270 (100%)

Time: 15.6 seconds, Memory: 68.00MB

Most of the differences are just whitespace, but some throw the same error you're getting. With the article you provided the problem resides when looping over a table and trying to access the childNodes. On 2.9.7, childNodes is never null... So that's the answer.

I'll see how to fix this. The problem can be easily fixed. Supporting both versions (.7 and .8) is a totally different story...

Confirmed, the issue is about whitespace. 2.9.7 seems to trim whitespaces while 2.9.8 leaves it. When readability tries to get the childNodes of whitespace (which essentially is a DOMText node, AKA can't have child nodes), it returns null and fails.

blat commented

Ok, so I can go to prod with last stable version as soon as I'm not upgrading libxml.
Thanks for investigation and confirmation!

As a workaround you can apply this diff:

diff --git a/src/Nodes/NodeTrait.php b/src/Nodes/NodeTrait.php
index 9ef1fa2..9a4654b 100644
--- a/src/Nodes/NodeTrait.php
+++ b/src/Nodes/NodeTrait.php
@@ -320,12 +320,16 @@ trait NodeTrait
      * @return array
      */
     public function getChildren($filterEmptyDOMText = false)
-    {
+    {
+        if (null === $this->childNodes) {
+            return [];
+        }
+
         $ret = iterator_to_array($this->childNodes);
         if ($filterEmptyDOMText) {
             // Array values is used to discard the key order. Needs to be 0 to whatever without skipping any number
             $ret = array_values(array_filter($ret, function ($node) {
-                return $node->nodeName !== '#text' || mb_strlen(trim($node->nodeValue));
+                return $node->nodeType !== XML_TEXT_NODE || mb_strlen(trim($node->nodeValue));
             }));
         }

But I'm working on removing getChildren altogether. It's redundant and can be replaced with ->childNodes and a utility that filters empty nodes.

Pushed all changes to develop branch. That should be safe for you to work with. Now I have to figure out how to make the test compatible with newer versions of libxml2...

blat commented

Thanks @andreskrey for the workaround, it works!

Just to let you know, with the develop branch, I get a new error:

Fatal error: Uncaught TypeError: Argument 1 passed to andreskrey\Readability\Nodes\NodeUtility::filterTextNodes() must be an instance of DOMNodeList, null given, called in vendor/andreskrey/readability.php/src/Nodes/NodeTrait.php on line 425 and defined in vendor/andreskrey/readability.php/src/Nodes/NodeUtility.php:169
Stack trace:
#0 vendor/andreskrey/readability.php/src/Nodes/NodeTrait.php(425): andreskrey\Readability\Nodes\NodeUtility::filterTextNodes(NULL)
#1 vendor/andreskrey/readability.php/src/Readability.php(1284): andreskrey\Readability\Nodes\DOM\DOMText->hasSingleTagInsideElement('tr')
#2 vendor/andreskrey/readability.php/src/Readability.php(1180): andreskrey\Readability\Readability->prepArticle(Object(andreskrey\Readability\Nodes\DOM\DOMDocument))
#3 vendor/andreskrey/readability.php/src/Readability.php(162): andreskrey\Readability\Readability->rateNodes(Array)
#4 test.php(14): an in vendor/andreskrey/readability.php/src/Nodes/NodeUtility.php on line 169

same HTML?

blat commented

Same URL: https://www.marketwatch.com//story//home-prices-are-still-on-fire-case-shiller-data-show-2018-03-27

I don't know if the content has changed during the last month.

Seems I am having the same or a similar issue, using it via Tiny Tiny RSS extension on a fresh box running Ubuntu 18.04.01 LTS with PHP 7.3.7 and LibXML 2.9.4 as far as I can see. Not an expert on any of this so not sure how I can help though.

Also not yet sure which url/domain/page it being tried yet... seems there might be several as I see a handful entries in the error log that looks like this.

Uncaught TypeError: Argument 1 passed to iterator_to_array() must implement interface Traversable, null given in /home/plesk/vhosts/site.net/httpdocs/tt-rss/plugins/af_readability/vendor/andreskrey/Readability/Readability.php:1274 Stack trace: #0 /home/plesk/vhosts/site.net/httpdocs/tt-rss/plugins/af_readability/vendor/andreskrey/Readability/Readability.php(1274): iterator_to_array(NULL) #1 /home/plesk/vhosts/site.net/httpdocs/tt-rss/plugins/af_readability/vendor/andreskrey/Readability/Readability.php(1166): andreskrey\Readability\Readability->prepArticle(Object(andreskrey\Readability\Nodes\DOM\DOMDocument)) #2 /home/plesk/vhosts/site.net/httpdocs/tt-rss/plugins/af_readability/vendor/andreskrey/Readability/Readability.php(155): andreskrey\Readability\Readability->rateNodes(Array) #3 /home/plesk/vhosts/site.net/httpdocs/tt-rss/plugins/af_readability/init.php(188): andreskrey\Readability\Readability->parse('<!DOCTYPE HTML>...') #4 /home/plesk/vhosts/site.net/httpdocs/tt-rss/plugins/af_readability/init.php(220):

Fixed. Please pull develop and parse the html again. Let me know how it goes.

Making some progress!

root@17f612b5a67e:/app#   php vendor/phpunit/phpunit/phpunit
PHPUnit 6.5.14 by Sebastian Bergmann and contributors.

...FFFF.FFFF.FFFFF.FFFF.FFFFFFFFFF.FFFFFF.FFFFFFFF.F.FFFFF.F.FF  63 / 273 ( 23%)
FFFFFF..FFFF.FF.FFF.FFFFFFFF................................... 126 / 273 ( 46%)
......................................F........................ 189 / 273 ( 69%)
............................................................... 252 / 273 ( 92%)
..................... 

(errors are gone, only failures now)

Sort of fixed in version 2.1.0

Still need to figure out a better way to compare HTMLs. Closing.

@andreskrey Just wanted to say thank you for sorting this out! Just found time to update with your latest build and at least an active live issue was was looking at right now seems to now not crash. Will review any previous issues and open a new issue in case something is still needed, but looks good. Thanks!

Happy for you! Glad you find the new version useful.