caxy/php-htmldiff

Different results

mandron opened this issue · 3 comments

Hi!

$old = "текст tekst";
$new = "тест test"; // Every word has only one character changed

v0.1.6

// the result is (php-htmldiff.caxy.com shows the same result)
те<del class="diffdel">к</del>ст // It is detected in cyrillic word
<del class="diffmod">tekst</del><ins class="diffmod">test</ins> // but it isn't detected in latin word

Is it possible to get a result like for cyrillic word but for Latin too?

v0.1.7

// the result is
<del class="diffmod">текст tekst</del><ins class="diffmod">тест test</ins></span>

// with groupDiff=false
<del class="diffmod">текст</del><ins class="diffmod">тест</ins>
<del class="diffmod">tekst</del><ins class="diffmod">test</ins>

Hi @mandron!

So, currently this library does not support diffing down to the character-level, it shows differences word-by-word. However, it can treat special characters (like commas, periods, etc.) as their own words.

I'm guessing the reason it was working the way that way for the cyrillic word in v0.1.6 was due to the issue that this library was not handling encoding properly (which was fixed in #72), and so it was treating those characters as "special" characters and therefore treating each character like it is a separate word: текст like 5 separate words: т, е, к, с, т.

We don't currently have plans to support diffing on a character-level, however would definitely be open to it. It might be something we'd plan for in the v1.0.0 release, depending on how much time I'll have in free time or if someone wants to contribute that 😄 .

So, unfortunately there isn't a way that's supported in the library to get the results like you show for the cyrillic word.

You might be able to do a hacky workaround by using loading up the setSpecialCaseChars() configuration, which is an array of characters that should be treated as their own word, with all of the characters in the alphabet ..... but I have no idea what kind of side-effects that would cause and it's not something I would recommend doing since that's not necessarily what that configuration is meant to be used for.

Popping a Status: Invalid label on this since how it was working in v0.1.6 was not actually how it was intended and instead was a side-effect of a bug.

I do think there is value in adding support for producing html diffs down to the character-level, however it is not supported currently.

Closing this issue based on the comments above. If there is interest in producing a character-level diff instead of the word-level diffing done now, please feel free to open an issue for a feature request