Processing time is very long on large diffs

Question

Processing time is very long on large diffs

nvahalik opened this issue 8 years ago · 9 comments

This is a great module. In most cases it works absolutely flawlessly.

However, it's worth noting that for big diffs (where the new and old text lengths) exceed 65K (in my case it was 128k worth of HTML).

Our workaround was simply to do do some pre-optimization to get the processing time down to <30s. (This was for ~9k and ~10k for old/new blocks). We removed some inlined images and did some de-duplication of elements within the blocks of HTML.

~40k of html combined (old + new) took around 316 seconds and ~60 didn't complete before 10 minutes.

Answer 1 · 2016-10-20T12:28:27.000Z

He nvahalik,

Can you PM me a long running HTML old and new text. I'd like to so some profiling.

Also maybe create a feature request for ignoring inline images. I ran into the same thing some time ago.

Answer 2 · 2017-01-23T07:26:00.000Z

@nvahalik Sorry we haven't had any action on this one yet - I created a new issue for a feature to ignore inline images.

As far as performance overall, that is definitely one of the top items on the priority list... while building upon the original library, performance took a back-seat to accuracy and human-readability - and the original use-case for us was comparing small snippets of HTML, so unfortunately the lack of priority on performance shows when processing large diffs.

Aside from removing inline images, was there any other pre-processing that you ended up doing that would be beneficial to include in php-htmldiff itself?

Answer 3 · 2017-01-23T16:02:36.000Z

Well, another improvement we made was to look for blocks of code that were equal and then hash and remove those from consideration. I've included the code I used below. The idea here was that the regular expressions would find and test the biggest "blocks" first and then replace those as they went down. Smaller blocks would get passed over and it would do this over the entire document. At least... that was the intention. Yeah, it's not pretty. However, on some tables it offered a huge performance gain for us.

 preg_match_all('/(<(p|li|table|ol)>.*<\/\2>)/', $oldHtml, $pmatches);
  foreach ($pmatches[1] as $content) {
    if (trim($content) == '') continue;

    if (strpos($newHtml, $content) !== false) {
      $hash = '##' . md5($content) . '##';
      $thashes[$hash] = $content;

      $newHtml = str_replace($content, $hash, $newHtml);
      $oldHtml = str_replace($content, $hash, $oldHtml);
    }
  }

Answer 4 · 2017-06-15T10:34:20.000Z

I found splitting html with regex to be not reliable - it leaves huge chunks of html intact. What I used instead is PHP's DOMDocument. Idea behind it is same but splits html correctly.

$tags = ['p', 'li', 'table', 'ol'];
$dom = new \DOMDocument();
$dom->loadHTML($oldHtml);
foreach ($tags as $tag) {
    $elements = $dom->getElementsByTagName($tag);
    foreach ($elements as $element) {
      $content = $element->textContent;

      if (trim($content) == '') {
        continue;
      }

      if (strpos($newHtml, $content) !== false) {
        $hash = '##' . md5($content) . '##';
        $thashes[$hash] = $content;

        $newHtml = str_replace($content, $hash, $newHtml);
        $oldHtml = str_replace($content, $hash, $oldHtml);
      }
    }
}

Answer 5 · 2018-12-15T04:32:30.000Z

@medis @nvahalik It would be great to include some of these performance improvements you have found in our library - I will be happy to review and accept contributions if people are up for it.

I will leave this issue open because it is certainly something we can improve upon, and its visibility is helpful for people evaluating if this library is a good fit for their project.

I'd like to get some of these improvements in when I can.

Answer 6 · 2018-12-18T18:03:03.000Z

@jschroed91 No doubt. I haven't messed with this in a long time and I had to go back and find code archive to revisit this, but in addition to the above, we did the following as well: replaced all inline images with hashes of their content.

As noted above, you pretty much have the code that has been running for a couple of years now.

I think if I had the chance to review it all over again, I'd probably walk the trees and hash the results from each of the walks and then only compare the hashes that don't match. Additionally, I'd take the hashes and set them up in a flat map and then compare the the maps again to find where the changes were.

But that's just me—I haven't studied this or anything. I'm just trying to apply what I'd do manually to how the code would work.

<div> <!--ab93ac-->
    <p>Some Text</p> <!--fa54a2-->
</div>

ab93ac
fa54a2

<div> <!--94bb32-->
    <div> <!--ab93ac-->
        <p>Some Text</p> <!--fa54a2-->
    </div>
</div>


94bb32 <-- inserted
ab93ac
fa54a2

Answer 7 · 2019-02-08T12:47:15.000Z

I help maintain a Drupal module that uses this library. Both reported timeouts were on HtmlDiff::isOnlyWhitespace()

return $str !== '' && (mb_strlen(trim($str)) === 0);

That use to be

return $str !== '' && (strlen(trim($str)) === 0);

The MB functions seem to be causing significant overhead and were also noted in #77

This file is pretty big and on v0.1.5 this took 21 seconds. After upgrading to v0.1.7 the comparison took 26 minutes.

Hope this helps

Answer 8 · 2019-02-13T22:19:07.000Z

See PR: #81

Answer 9 · 2019-02-20T18:54:53.000Z

Changes from PR #81 are included in new release v0.1.9