`diff`: fix poor performance

Question

`diff`: fix poor performance

Opened this issue 2 months ago · 1 comments

Is your feature request related to a problem? Please describe.
diff appears to process files multiple orders of magnitude slower than other implementations:

❯ time -- diff -- ./file-a ./file-b
Elapsed time: 3.381272 seconds
User time: 0.000000 seconds
System time: 0.000000 seconds

❯ time -- diffutils -- ./file-a ./file-b
Elapsed time: 0.005884 seconds
User time: 0.000000 seconds
System time: 0.000000 seconds

❯ du -k -- ./file-a ./file-b
1384    ./file-a
1384    ./file-b

In this case, file-a and file-b are identical.

Describe the solution you'd like
I think we should aim for diff to be able to process files at 10 mebibytes per second, at the very least (on my system, the current implementation is getting less than 0.5 mebibytes per second). On my system, diffutils from uutils does over 200 mebibytes per second.

Describe alternatives you've considered
N/A

Additional context
N/A

Answer 1 · 2024-09-30T09:27:04.000Z

Indeed, the current implementation seems sub-optimal. For every line in file1 it is iterating over every line of file2. Comparing a 90Mb ascii txt file makes it crash :)

I implemented the histogram diff algorithm, probably a bit naively at this point, but it is giving decent results so far (4.5s for diffing two 90Mb files). I was just a bit baffled by the Hunk struct and why the need for the changes attribute. In my opinion, once we have the kind and line start/end for both files we have everything we need. But maybe I am missing something ?

Note: out of the 4.5s, 2s are spent just loading the files (11M lines...)

Anyway, I should be able to open a draft PR soon and maybe get some more guidance.