paquettg/php-html-parser

DOM Cleaner: mb_eregi_replace errors out with retry-limit-in-match

half0wl opened this issue · 2 comments

Reproduction:

>>> use PHPHtmlParser\Dom;
>>> $dom = new Dom;
>>> $dom->loadFromUrl("https://casper.com/gifts/?clickid=T02U6OVQYxyLUbdwUx0Mo36dUkB1HNWwiSMnwQ0");

Throws:

PHP Warning:  mb_eregi_replace(): mbregex search failure in php_mbereg_replace_exec(): retry-limit-in-match
over in <stripped>/paquettg/php-html-parser/src/PHPHtmlParser/Dom/Cleaner.php on line 81
PHPHtmlParser\Exceptions\LogicalException with message 'mb_eregi_replace returned false instead of a string.
Error when attempting to remove scripts 2.'

I've tried ini_set("pcre.backtrack_limit", "10000000000") after some Googlefu on the error, but it doesn't work.

I can reproduce this on pages with huge <script></script> tags, typically when there's a giant blob of JSON object in it.

I have the exact same problem but with a different URL. I quick-fixed it by disabling script removal from the HTML with $dom->setOptions((new Options())->setRemoveScripts(false)); but I would rather have a real fix for this, especially because there's a warning that keeping script tags could have unforeseen consequences.

Any help on this issue please @paquettg ?

Ok, I've fixed it without disabling tag removal by increasing the mb retry limit to 10 million. The self-documented php.ini describes this:

; This directive specifies maximum retry count for mbstring regular expressions. It is similar
; to the pcre.backtrack_limit for PCRE.
; Default: 1000000
;mbstring.regex_retry_limit=1000000

so I've used

ini_set("mbstring.regex_retry_limit", "10000000");

and all works fine on this front now