juji-io/editscript

Long runtime for the quick algorithm on some large data structures

sigvesn opened this issue · 9 comments

At microdata.no we use editscript with the :quick diff algorithm to generate diff's between potentially large edn structures.

This usually works very well, but i have encountered a special case where the editscript/diff with the :quick algorithm is very slow.

It is hard to pinpoint exactly what it is in the data that causes this, as the compared files are about 2.4MB each, but here is a repo with two example files (which also contains code to run the diff):

https://github.com/sigvesn/editscript-quickdiff-example

  • each is a vector of about 200 maps
  • the map entries potentially contains a lot of data
  • almost each map in the vectors are equal, except for one boolean field :active with is present for all maps in old but only one map in new

diff using either the "non-quick" algorithm (or removing the :active field from the maps in old) runs in ~0.2-1 second

Using :quick true results in a runtime of ~7.5 minutes

Interesting, I will take a look.

Can the problem be connected to the fact that editscript seems to not be able to generate a diff at all, but instead returns the whole file as a replace action?

This is seen in the example output file from the diff generated by https://github.com/sigvesn/editscript-quickdiff-example/blob/main/src/main.clj#L19 at https://github.com/sigvesn/editscript-quickdiff-example/blob/main/out/diff.json

I compared this against a diff created by a different tool: https://www.npmjs.com/package/small-json-diff created for the same files: https://github.com/sigvesn/editscript-quickdiff-example/blob/main/out/diff_small_json.json . This diff shows the expected changes as i described above

I can tell you what I know so far:

  • The quick algorithm is extremely slow diffing two huge strings, because you enabled :str-diff?. String diff is character by character, which is an extremely expensive operation. We are already using the fastest algorithm, there's not much improvement to speak of. You can try looking for alternatives, but you won't find much faster one. Notice that most diff tools does diffing line by line, it is for good reason.
  • The A* star algorithm chose to replace the whole thing, because the diff size is smaller that way. Our definition of diff size has been changed to reflect strictly the object count (in the past, we only count the number of operations, which may give the results you find desirable), replacing the whole thing is often smaller than having a diff data structure, with its additional operations, paths (the path is a vector, so it includes lots of objects of its own), and so on. The PATCH format in that JSON tool you show somehow does not contain path for those "remove" operations, for example, I am not sure how it could be used to patch things correctly in a general case. What if not all the maps have the same operation? how does it know which map to change without such path? If paths are included, the object count is going to be larger than just replacing the whole thing. Our diff format can be concatenated, so each diff operation must include enough context to be applied on its own. Such requirement is not present in JSON PATCH format, which is only applicable to a JSON file as a whole, not to a stream of diffs. The goal of Editscript is to help with reducing transported data size, hence our focus on real diff size. The use case is for transporting small changes applied on large data structure. This is not a tool for inspecting by human what changes have been done after some big changes. For that, you should be using other tools.

A piece of advise: unless strictly necessary, I advise against a list or a vector of maps as a data structure. The reason is that a list or a vector implies order, but order is an expensive thing to keep when diffing. If the use case does not call for order, why not use a set? EDN introduces sets for good reason.

There's not much I can do at this point. The quick algorithm compare things blindly so it attempted to compare two huge strings, which is going to cost a lot of time. My advise is to use the default and do not compare strings in your case. Because you have some huge strings. Since your use case is to sync client/server state, I really don't see why you would like to compare huge strings like that. It costs a lot more to compare strings than just sending the whole thing over the network.

The A* algorithm did the right thing. If you insist on getting what you think you should be getting, you will have to use something else. Editscript is not designed for such huge changes, we intend it to extract small changes that are much smaller than the original data. In the example you show, the changes will be bigger than the original data using our diff format. In such case, overwriting the data is the right thing to do.

Closing.

On a second thought, I can introduce a time-out for vector diff, so we can skip the computation if time is out.

Another option is to diff strings by lines instead of by characters. That should speed up things significantly.

These changes will be introduced in 0.6.0 release.

Not using string diffing seems like a good solution for us. A time-out would also be welcome.

Thank you very much for the help and advice!

0.6.1 now allows controlling string diffing, and added a timeout option.

Using :str-diff :line option, running quick diff algorithm on your data took only 570ms on my machine.
Using :str-diff :word instead, some string diff will time out, but the total time took 4.8s on my machine.

The new :str-diff options and timeout seems to be working great. Some small notices:

  • Using an opts map with diff no longer works in clojure 1.10 and will throw an java.lang.IllegalArgumentException: No value supplied for key: {} error. Might be worth to note that 1.11 is required.

  • Small error in the changelog where is says :diff-str instead of the correct :str-diff

  • With some small values of timeout, the quick diff algorithm (and large enough data structures to trigger a timeout?) causes the error

Execution error (IllegalArgumentException) at editscript.diff.quick/diff-vec (quick.cljc:45).
Don't know how to create ISeq from: clojure.lang.Keyword

this can be reproduced by adding :algo :quick to the timeout test

Thank you for testing. These issues are addressed in 0.6.2.