caxy/php-htmldiff

Slow diff even on small text input

arvindpdmn opened this issue · 7 comments

My strings are mostly text (actually markdown) and sometimes contain a few HTML tags.

Suppose I delete a couple of words near the start of a line. On the same line towards the end, I add a couple of words. This is one of the few diff programs I know that highlights changes at word level. That's cool!

But it's slow. It takes 40+ seconds for a small file of about 100 lines and 1500 words.

Anything I'm missing such as setting a config option properly? Thx

@arvindpdmn Unfortunately there is not right now - this library was focused solely on accuracy for its initial purpose, and we have not had the time to improve its efficiency. Can you put an example of a small file you’re working with in a gist and provide a link here? I’d like to see if I can find areas to improve performance

The files are attached. The code to do the diff is below. I had to wrap the text within a div in order to preserve the newlines. I noticed that the diff process removes the newlines.

Also, this morning I integrated this code into my Joomla framework where the diff output is written to file rather than sent back to the browser. Surprisingly, it doesn't take 40+ seconds to do the diff when called within the framework. It takes may be a couple of seconds! Not sure what's going on.

<?php

require __DIR__ . '/vendor/autoload.php';

use Caxy\HtmlDiff\HtmlDiff;

$oldHtml = file_get_contents('old.txt');
$newHtml = file_get_contents('new.txt');

$oldHtml = '<div class="filler">' . preg_replace('/\r\n/', '</div><div class="filler">', $oldHtml) . '</div>';
$newHtml = '<div class="filler">' . preg_replace('/\r\n/', '</div><div class="filler">', $newHtml) . '</div>';


$htmlDiff = new HtmlDiff($oldHtml, $newHtml);

$content = $htmlDiff->build();

$content = preg_replace('/<\/div><div class="filler">/', '\r\n', $content);
$content = preg_replace('/^<div class="filler">/', '', $content);
$content = preg_replace('/<\/div>$/', '', $content);

echo $content;

new.txt
old.txt

@arvindpdmn It might be good to know that this diffing library focuses on diffing html content, this is why your newlines might be removed.

If you are only diffing simple flat text (I based this assumption on your attachments), then I am pretty sure that you can get a more simple diffing library that will be faster since it does not have all the overhead of trying to keep markup in tact while calculating the changes.

Also; Thank you for the feedback, it might be helpfull in trying to speed up this library.

Thx. I realized that many other libraries simply do line-level diff and don't support word-level diff. That's why I selected this one. Plus, this project has so many stars and close to half a million installs as seen on https://packagist.org/packages/caxy/php-htmldiff !

Today I deployed the code on remote server, but thankfully on staging and not on production. Diff is slow as reported earlier. Will investigate.

@arvindpdmn Make sure you are atleast running PHP 7.0, that really makes a huge difference.

Hmm... version on server I'm using is 5.6.32