slightly different results of cleanupSemantic/cleanupEfficiency for C++/Java vs. Python impl.
GoogleCodeExporter opened this issue · 0 comments
GoogleCodeExporter commented
What version of the product are you using? On what operating system?
SVN r103.
Please provide any additional information below.
During my port of diff_match_patch to another language I
did a comparison of the output of the other implementations,
esp. Java, C++, and Python. I ran diff_main on speedtest1.txt
and speedtest2.txt files, then various cleanup functions.
For functions diff_cleanupSemantic and diff_cleanupEfficiency
I noticed slight differences between C++/Java and Python
implementations.
I suppose the Python version is correct, because the output of
the other implementations contained some non-optimal Diffs
that were expected to get eliminated.
A patch is appended that contains fixes for C++ and Java
implementations, so that they create identical output as the
Python version. I've provided new Java test cases.
(I did not check the other languages.)
I have found two reasons for the differences:
1. In diff_cleanupSemantic (and also in diff_cleanupEfficiency),
there is a situation where the previous equality is
thrown away, then the pointer walks back to the preceding
equality. A "pointer.next()" follows, then the loop
starts over processing the equality. I think the pointer
should actually be positioned *behind* the equality, e.g. an
additional "pointer.next()" should be inserted after the
walking back (Note: the existing "pointer.next()" does not
step over the equality, it only moves the iterator's internal cursor.)
With the current implementation, in some cases, an equality
that has been returned to may be marked as safe, which is
not desired, because such an equality can't be reevaluated
anymore.
The patch for the C++ and Java implementation inserts
an additional pointer.next() after the while loop in
cleanupSemantic/cleanupEfficiency to make sure pointer
_steps over the equality_, so that it is positioned at the
next insertion/deletion.
Two new test cases: "Double backpass elimination".
The reason for C++ and Java being affected in nearly
the same way, is the similarity of Qt's QMutableListIterator
and Java's ListIterator.
2. The second problem has something to do with the "safe"
position handling in diff_cleanupEfficiency.
If a replacement of an equality occurs, the equality is
replaced by a deletion and an insertion. If there has been
found both an insertion and a deletion before, the position
will be marked as safe. Later there might be a condition
so that the process walks back to the safe position, to
start reevaluating Diffs. The problem here with the C++
and Java implementations is that the safe position does
not point to the deletion part of the replacement, but to
the insertion part. This means, the deletion part won't
go into the calculation of the next replacement conditions;
thus, a replacable equality may get overlooked.
See the test case "Safe backpass elimination".
Original issue reported on code.google.com by mt4...@googlemail.com
on 6 Jun 2012 at 2:33
Attachments: