`diff`: fix poor performance
Opened this issue · 1 comments
Is your feature request related to a problem? Please describe.
diff
appears to process files multiple orders of magnitude slower than other implementations:
❯ time -- diff -- ./file-a ./file-b
Elapsed time: 3.381272 seconds
User time: 0.000000 seconds
System time: 0.000000 seconds
❯ time -- diffutils -- ./file-a ./file-b
Elapsed time: 0.005884 seconds
User time: 0.000000 seconds
System time: 0.000000 seconds
❯ du -k -- ./file-a ./file-b
1384 ./file-a
1384 ./file-b
In this case, file-a
and file-b
are identical.
Describe the solution you'd like
I think we should aim for diff
to be able to process files at 10 mebibytes per second, at the very least (on my system, the current implementation is getting less than 0.5 mebibytes per second). On my system, diffutils
from uutils does over 200 mebibytes per second.
Describe alternatives you've considered
N/A
Additional context
N/A
Indeed, the current implementation seems sub-optimal. For every line in file1 it is iterating over every line of file2. Comparing a 90Mb ascii txt file makes it crash :)
I implemented the histogram diff algorithm, probably a bit naively at this point, but it is giving decent results so far (4.5s for diffing two 90Mb files). I was just a bit baffled by the Hunk
struct and why the need for the changes
attribute. In my opinion, once we have the kind
and line start/end for both files we have everything we need. But maybe I am missing something ?
Note: out of the 4.5s, 2s are spent just loading the files (11M lines...)
Anyway, I should be able to open a draft PR soon and maybe get some more guidance.