caxy/php-htmldiff

Some diffs take far too long even with no multi-byte

Ambient-Impact opened this issue · 10 comments

Hi there. I've been using this on a Drupal project where we need to highlight the differences between the rendered output of two nodes, and it works alright for most of our content, but a few seem to take far longer to generate, and sometimes hit the PHP max execution time limit (30 seconds on the remote server, 120 on the local dev).

I've tried to figure out exactly what might be causing such a big variation in diff times, without success:

  • Made sure that the strings provided for diffing were not triggering the use of PHP multi-byte string functions. (See #57 and #77)

  • Disabled isolated list diffing.

  • Disabled almost all Drupal input filters, especially ones that added extra data and attributes to the output.

My last resort if this can't be resolved is likely to be to do the diffing asynchronously, but I'd prefer to avoid having to implement that if I can. Any advice?

He @Ambient-Impact maybe if you supply the left and right side of the div I can do some deeper investigation for you.

Performance of the library should become better with newer versions, since I am actively working on it, specially for non-tables and non-lists.

One of the things that make's a huge difference is running at least php7.2, but I suspect you are already on a recent version of PHP.

He @Ambient-Impact maybe if you supply the left and right side of the div I can do some deeper investigation for you.

Here they are: left.html.txt and right.html.txt.

Performance of the library should become better with newer versions, since I am actively working on it, specially for non-tables and non-lists.

Glad to hear it!

One of the things that make's a huge difference is running at least php7.2, but I suspect you are already on a recent version of PHP.

On my local machine, I'm running PHP 7.3.15, and on our server, we've running 7.4.14, so yes, we are. ;)

Thanks for the files, that is helpful for me to debug it, if I have some findings i'll let you know.

@SavageTiger Thanks, I appreciate it. 🤓

I have been digging into the performance issues using a profiler, and I found out that loads of time is spend parsing the text into words, so I have been working on a new parser that should be more efficient, see this WIP PR with a bunch of broken stuff inside #102

In my testing the performance group went from 900ms to 600ms, but some tests still fail, so I have to figure out what I broke.

Sounds like progress to me. Let me know if there's anything I can contribute, though I'm not that familiar with this codebase as you would be.

He @Ambient-Impact I just released a new version that is way faster.

I have been testing with the text you provided and on my local dev machine it took 16 seconds with version 0.1.11, and it takes 7.5 seconds with version 0.1.12

See the screenshots in the PR.

In my case it used multibyte functions, since the text you provided required it.

Can confirm that the new version is indeed faster, even with multi-byte. I'll have to experiment with converting the multi-byte stuff to HTML entities to see if that makes a significant difference. Thanks for all the work so far!

Closing this issue, since I think this is solved, or well, as solved as I could manage :)

We ended up implementing an asynchronous/preemptively cached system, so this turned out to be less crucial, but having each job take less time is definitely still very appreciated!