Minimizing the number of lines
Closed this issue · 8 comments
GoogleCodeExporter commented
Consider the left text :
AAA
BBB EEE
and the right text
AAA
BBB DDD
BBB EEE
To the human eye, it's obvious that the middle was added.
However, gdmp (and other character level diff algorithms I tried) see it
differently, spanning the
added string ("DDD\nBBB") over two lines.
Of course it makes sense from a computational standpoint, but that should at
least be cleaned
up by the cleanup functions, which should try to minimize the number of lines.
What do you think ?
Original issue reported on code.google.com by dpere...@gmail.com
on 20 Jun 2008 at 9:35
GoogleCodeExporter commented
I agree. Thanks for bringing this up (and pinging me by email as requested).
This would be an issue with diff_cleanupSemanticLossless which does currently
slide
differences to line up with whitespace, but doesn't rank line breaks as having
a
higher value than a space.
There would seem to be five things to attempt to slide edit boundaries to (in
order
of preference):
1. Start/End of entire text.
2. Blank lines.
3. Line breaks.
4. Whitespace.
5. Non alpha-numeric.
I'll get on this. If you want you can let me know what language you are using
(C++,
JavaScript, Java, Python) and I'll do that one first and send you a prerelease
copy
before I get the others updated. (Yay for 20% time at Google.)
Original comment by neil.fra...@gmail.com
on 20 Jun 2008 at 10:34
GoogleCodeExporter commented
I just managed to get what I expected by changing a single character in the
source code :-)
I'm working with the javascript version.
In function diff_cleanupSemanticScore(), I changed :
var whitespace = /(\s)/;
var whitespace = /(\n)/;
Of course the clean up obviously no longer works with words, just lines, but
that's a start.
Original comment by dpere...@gmail.com
on 20 Jun 2008 at 10:35
GoogleCodeExporter commented
Aye, a quick and dirty (and 90% effective) solution is to use the following:
function diff_cleanupSemanticScore(one, two, three) {
if (!one || !three) {
// Edges are the best.
return 10;
}
var score = 0;
var whitespace = /\s/;
if (one.charAt(one.length - 1).match(whitespace) ||
two.charAt(0).match(whitespace)) {
score++;
}
if (two.charAt(two.length - 1).match(whitespace) ||
three.charAt(0).match(whitespace)) {
score++;
}
var linebreak = /[\r\n]/;
if (one.charAt(one.length - 1).match(linebreak ) ||
two.charAt(0).match(linebreak )) {
score++;
}
if (two.charAt(two.length - 1).match(linebreak ) ||
three.charAt(0).match(linebreak )) {
score++;
}
return score;
}
This probably even passes the existing unit tests.
Original comment by neil.fra...@gmail.com
on 20 Jun 2008 at 10:42
GoogleCodeExporter commented
Hehe, you're fast ;-)
Original comment by dpere...@gmail.com
on 20 Jun 2008 at 10:51
GoogleCodeExporter commented
What about :
var linebreak = /[\r*\n]/;
Instead of
var linebreak = /[\r\n]/;
Original comment by dpere...@gmail.com
on 20 Jun 2008 at 10:55
GoogleCodeExporter commented
A new version of diff, match, patch has just been posted in all four languages
which
solves this problem thoroughly using the five-point list in comment 1.
Original comment by neil.fra...@gmail.com
on 25 Jun 2008 at 12:22
- Changed state: Fixed
GoogleCodeExporter commented
Fixed?
then try:
diff_match_patch_20110217
-------------------------------left:
/**
* init。
*
*/
public void init() {
this.start();
}
-------------------------------right:
/**
* init。
*
* @throws InterruptedException
* @throws SQLException
*
*/
@SuppressWarnings("unused")
private void init() throws SQLException, InterruptedException {
this.start();
}
Original comment by btp...@gmail.com
on 24 Mar 2011 at 7:05
GoogleCodeExporter commented
The current behaviour is correct. I assume you are using the cleanupSemantic
function? In that case you get a diff with the following sequence:
EQUAL, DELETION, INSERTION, EQUAL
And I assume that you are expecting that the " */" line should be preserved
(which it currently is not). If that were the case, then the diff would have
the following sequence:
EQUAL, INSERTION, EQUAL(" */"), DELETION, INSERTION, EQUAL
This is a more complicated diff sequence, just to save three characters (two of
which match well on an earlier line).
Any change to make diff preserve this line at the expense of a more complicated
diff sequence would also cause it to preserve coincidental matches as well --
three characters is a pretty small match.
Original comment by neil.fra...@gmail.com
on 24 Mar 2011 at 5:45