icdiff hanging on large files
ChamiLamelas opened this issue · 5 comments
Hello,
My colleague and I were interested in using icdiff
inside of a university autograding software (linked here). However, we have encountered an issue where calling icdiff
on somewhat large files seems to hang.
I have attached two ~1.5 MB UTF-8 files here, these files do differ. Some notes from running diff
and icdiff
on these files manually:
- Standard
diff
runs in less than a second. - However,
icdiff
as well aspython3 -m icdiff
both seem to hang after displayingfile1.txt
andfile2.txt
.
Any input on this issue would be appreciated.
This is an issue in python's difflib
. You can get diff.py
from https://docs.python.org/3/library/difflib.html and you'll see python3 diff.py file1.txt file2.txt
hangs (or just is very slow).
I think this is "expected behavior" for difflib
: "The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear." --https://docs.python.org/3.5/library/difflib.html#difflib.SequenceMatcher
You could consider filing a bug on difflib
, or contributing something there that improves worst-case behavior?
Is adding an option to use cdifflib something you'd consider?
@mattrussell2 I think that wouldn't solve the problem? Skimming that package. It looks like a re-implementation in C, but still the same algorithm?
(It's also missing some parts of difflib that icdiff depends on, but that might not be insurmountable?)
Fair enough. Yeah 4x is certainly not going to fix polynomial-time issues.
If anyone's interested, the relevant issue re: difflib is here: python/cpython#51180
What about using google's diff-match-patch?
https://github.com/google/diff-match-patch
This also takes an optional timeout parameter.
From my very quick reading of the source code here, seems like this library should be interoperable....but I'll dig into it a bit more.