caxy/php-htmldiff

Possible diff algorithm improvement

maliayas opened this issue · 3 comments

I'm using the "Override Demo 5" in the demo and I get this result:

ekran goruntusu 2016 04 10 23_34_34

Another diff app I try gives this result for the same HTML:

ekran goruntusu 2016 04 10 23_34_46

Note how it handles the first paragraph. I don't know how complex it is to implement but it's a better algorithm for this example HTML. Thanks in advance.

@maliayas Thanks for opening this issue!

This is actually an undesired side-effect of one of the newer features we implemented - isolated tag diffing. I had not actually thought about this until now, so I'm very glad you opened this issue.

Basically, the isolated tag diffing is comparing the italic tags emergency escape and rescue openings separately from the rest of the content - this is to fix a lot of the issues we had with the diff output not protecting the HTML structure.
The issue here is that in order to diff them in isolated, we actually replace the entire tag with a placeholder "word" before we diff the content. The diffing algorithm is not aware of the length of the string that the placeholder represents, and therefore sees it as 1 word, and in this case it is finding a longer match in shall be than it is in the placeholder match.

So, it will take a little of work, but is certainly possible. This will be one of the higher priorities to tackle.

I see. Great explanation. If fixing this, will break other stuff, don't worry about this issue. I understand that perfecting a diff library may be quite complex.

Btw. demo tool is awesome.

Looping back around here - our highest priority of this library was the accuracy of the diff, so unfortunately performance took a back seat to it. However, we do like to leave that decision up to the end users when we can - the config option setIsolatedDiffTags is used to define which tags are diffed in isolation, and currently the defaults include i and em tags.

I'll see if I can update the documentation to highlight the reasoning behind choosing this as the default option this weekend.

Closing this issue.