Long runtime for the quick algorithm on some large data structures
sigvesn opened this issue · 9 comments
At microdata.no we use editscript with the :quick
diff algorithm to generate diff's between potentially large edn structures.
This usually works very well, but i have encountered a special case where the editscript/diff
with the :quick
algorithm is very slow.
It is hard to pinpoint exactly what it is in the data that causes this, as the compared files are about 2.4MB each, but here is a repo with two example files (which also contains code to run the diff):
https://github.com/sigvesn/editscript-quickdiff-example
- each is a vector of about 200 maps
- the map entries potentially contains a lot of data
- almost each map in the vectors are equal, except for one boolean field
:active
with is present for all maps inold
but only one map innew
diff
using either the "non-quick" algorithm (or removing the :active
field from the maps in old
) runs in ~0.2-1 second
Using :quick true
results in a runtime of ~7.5 minutes
Interesting, I will take a look.
Can the problem be connected to the fact that editscript seems to not be able to generate a diff at all, but instead returns the whole file as a replace action?
This is seen in the example output file from the diff generated by https://github.com/sigvesn/editscript-quickdiff-example/blob/main/src/main.clj#L19 at https://github.com/sigvesn/editscript-quickdiff-example/blob/main/out/diff.json
I compared this against a diff created by a different tool: https://www.npmjs.com/package/small-json-diff created for the same files: https://github.com/sigvesn/editscript-quickdiff-example/blob/main/out/diff_small_json.json . This diff shows the expected changes as i described above
I can tell you what I know so far:
- The quick algorithm is extremely slow diffing two huge strings, because you enabled
:str-diff?
. String diff is character by character, which is an extremely expensive operation. We are already using the fastest algorithm, there's not much improvement to speak of. You can try looking for alternatives, but you won't find much faster one. Notice that most diff tools does diffing line by line, it is for good reason. - The A* star algorithm chose to replace the whole thing, because the diff size is smaller that way. Our definition of diff size has been changed to reflect strictly the object count (in the past, we only count the number of operations, which may give the results you find desirable), replacing the whole thing is often smaller than having a diff data structure, with its additional operations, paths (the path is a vector, so it includes lots of objects of its own), and so on. The PATCH format in that JSON tool you show somehow does not contain path for those "remove" operations, for example, I am not sure how it could be used to patch things correctly in a general case. What if not all the maps have the same operation? how does it know which map to change without such path? If paths are included, the object count is going to be larger than just replacing the whole thing. Our diff format can be concatenated, so each diff operation must include enough context to be applied on its own. Such requirement is not present in JSON PATCH format, which is only applicable to a JSON file as a whole, not to a stream of diffs. The goal of Editscript is to help with reducing transported data size, hence our focus on real diff size. The use case is for transporting small changes applied on large data structure. This is not a tool for inspecting by human what changes have been done after some big changes. For that, you should be using other tools.
A piece of advise: unless strictly necessary, I advise against a list or a vector of maps as a data structure. The reason is that a list or a vector implies order, but order is an expensive thing to keep when diffing. If the use case does not call for order, why not use a set? EDN introduces sets for good reason.
There's not much I can do at this point. The quick algorithm compare things blindly so it attempted to compare two huge strings, which is going to cost a lot of time. My advise is to use the default and do not compare strings in your case. Because you have some huge strings. Since your use case is to sync client/server state, I really don't see why you would like to compare huge strings like that. It costs a lot more to compare strings than just sending the whole thing over the network.
The A* algorithm did the right thing. If you insist on getting what you think you should be getting, you will have to use something else. Editscript is not designed for such huge changes, we intend it to extract small changes that are much smaller than the original data. In the example you show, the changes will be bigger than the original data using our diff format. In such case, overwriting the data is the right thing to do.
Closing.
On a second thought, I can introduce a time-out for vector diff, so we can skip the computation if time is out.
Another option is to diff strings by lines instead of by characters. That should speed up things significantly.
These changes will be introduced in 0.6.0 release.
Not using string diffing seems like a good solution for us. A time-out would also be welcome.
Thank you very much for the help and advice!
0.6.1 now allows controlling string diffing, and added a timeout option.
Using :str-diff :line
option, running quick diff algorithm on your data took only 570ms on my machine.
Using :str-diff :word
instead, some string diff will time out, but the total time took 4.8s on my machine.
The new :str-diff options and timeout seems to be working great. Some small notices:
-
Using an
opts
map withdiff
no longer works in clojure 1.10 and will throw anjava.lang.IllegalArgumentException: No value supplied for key: {}
error. Might be worth to note that 1.11 is required. -
Small error in the changelog where is says
:diff-str
instead of the correct:str-diff
-
With some small values of timeout, the quick diff algorithm (and large enough data structures to trigger a timeout?) causes the error
Execution error (IllegalArgumentException) at editscript.diff.quick/diff-vec (quick.cljc:45).
Don't know how to create ISeq from: clojure.lang.Keyword
this can be reproduced by adding :algo :quick
to the timeout test
Thank you for testing. These issues are addressed in 0.6.2.