jeffkaufman/icdiff

icdiff hanging on large files

ChamiLamelas opened this issue · 5 comments

Hello,

My colleague and I were interested in using icdiff inside of a university autograding software (linked here). However, we have encountered an issue where calling icdiff on somewhat large files seems to hang.

I have attached two ~1.5 MB UTF-8 files here, these files do differ. Some notes from running diff and icdiff on these files manually:

  • Standard diff runs in less than a second.
  • However, icdiff as well as python3 -m icdiff both seem to hang after displaying file1.txt and file2.txt.

Any input on this issue would be appreciated.

file1.txt
file2.txt

This is an issue in python's difflib. You can get diff.py from https://docs.python.org/3/library/difflib.html and you'll see python3 diff.py file1.txt file2.txt hangs (or just is very slow).

I think this is "expected behavior" for difflib: "The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear." --https://docs.python.org/3.5/library/difflib.html#difflib.SequenceMatcher

You could consider filing a bug on difflib, or contributing something there that improves worst-case behavior?

Is adding an option to use cdifflib something you'd consider?

https://stackoverflow.com/a/66694154/8742968

https://pypi.org/project/cdifflib/

@mattrussell2 I think that wouldn't solve the problem? Skimming that package. It looks like a re-implementation in C, but still the same algorithm?

(It's also missing some parts of difflib that icdiff depends on, but that might not be insurmountable?)

Fair enough. Yeah 4x is certainly not going to fix polynomial-time issues.

If anyone's interested, the relevant issue re: difflib is here: python/cpython#51180

What about using google's diff-match-patch?

https://github.com/google/diff-match-patch

This also takes an optional timeout parameter.

From my very quick reading of the source code here, seems like this library should be interoperable....but I'll dig into it a bit more.