inje/google-diff-match-patch

Minimizing the number of lines

Closed this issue · 8 comments

Consider the left text :

AAA
BBB EEE

and the right text

AAA
BBB DDD
BBB EEE

To the human eye, it's obvious that the middle was added.
However, gdmp (and other character level diff algorithms I tried) see it 
differently, spanning the 
added string ("DDD\nBBB") over two lines.
Of course it makes sense from a computational standpoint, but that should at 
least be cleaned 
up by the cleanup functions, which should try to minimize the number of lines.
What do you think ?



Original issue reported on code.google.com by dpere...@gmail.com on 20 Jun 2008 at 9:35

I agree.  Thanks for bringing this up (and pinging me by email as requested).

This would be an issue with diff_cleanupSemanticLossless which does currently 
slide 
differences to line up with whitespace, but doesn't rank line breaks as having 
a 
higher value than a space.

There would seem to be five things to attempt to slide edit boundaries to (in 
order 
of preference):
1. Start/End of entire text.
2. Blank lines.
3. Line breaks.
4. Whitespace.
5. Non alpha-numeric.

I'll get on this.  If you want you can let me know what language you are using 
(C++, 
JavaScript, Java, Python) and I'll do that one first and send you a prerelease 
copy 
before I get the others updated.  (Yay for 20% time at Google.)


Original comment by neil.fra...@gmail.com on 20 Jun 2008 at 10:34

I just managed to get what I expected by changing a single character in the 
source code :-)
I'm working with the javascript version.
In function diff_cleanupSemanticScore(), I changed :
var whitespace = /(\s)/;
var whitespace = /(\n)/;
Of course the clean up obviously no longer works with words, just lines, but 
that's a start.

Original comment by dpere...@gmail.com on 20 Jun 2008 at 10:35

Aye, a quick and dirty (and 90% effective) solution is to use the following:


  function diff_cleanupSemanticScore(one, two, three) {
    if (!one || !three) {
      // Edges are the best.
      return 10;
    }
    var score = 0;
    var whitespace = /\s/;
    if (one.charAt(one.length - 1).match(whitespace) ||
        two.charAt(0).match(whitespace)) {
      score++;
    }
    if (two.charAt(two.length - 1).match(whitespace) ||
        three.charAt(0).match(whitespace)) {
      score++;
    }
    var linebreak = /[\r\n]/;
    if (one.charAt(one.length - 1).match(linebreak ) ||
        two.charAt(0).match(linebreak )) {
      score++;
    }
    if (two.charAt(two.length - 1).match(linebreak ) ||
        three.charAt(0).match(linebreak )) {
      score++;
    }
    return score;
  }

This probably even passes the existing unit tests.

Original comment by neil.fra...@gmail.com on 20 Jun 2008 at 10:42

Hehe, you're fast ;-)

Original comment by dpere...@gmail.com on 20 Jun 2008 at 10:51

What about :
    var linebreak = /[\r*\n]/;
Instead of
    var linebreak = /[\r\n]/;


Original comment by dpere...@gmail.com on 20 Jun 2008 at 10:55

A new version of diff, match, patch has just been posted in all four languages 
which 
solves this problem thoroughly using the five-point list in comment 1.

Original comment by neil.fra...@gmail.com on 25 Jun 2008 at 12:22

  • Changed state: Fixed
Fixed?

then try:
diff_match_patch_20110217


-------------------------------left:
/**
 * init。
 *  
 */
public void init() {
    this.start();
}
-------------------------------right:
/**
 * init。
 *  
 * @throws InterruptedException 
 * @throws SQLException 
 *
 */
@SuppressWarnings("unused")
private void init() throws SQLException, InterruptedException {
    this.start();
}

Original comment by btp...@gmail.com on 24 Mar 2011 at 7:05

The current behaviour is correct.  I assume you are using the cleanupSemantic 
function?  In that case you get a diff with the following sequence:
  EQUAL, DELETION, INSERTION, EQUAL
And I assume that you are expecting that the " */" line should be preserved 
(which it currently is not).  If that were the case, then the diff would have 
the following sequence:
  EQUAL, INSERTION, EQUAL(" */"), DELETION, INSERTION, EQUAL
This is a more complicated diff sequence, just to save three characters (two of 
which match well on an earlier line).

Any change to make diff preserve this line at the expense of a more complicated 
diff sequence would also cause it to preserve coincidental matches as well -- 
three characters is a pretty small match.

Original comment by neil.fra...@gmail.com on 24 Mar 2011 at 5:45