vmware-archive/database-stream-processor

Comparison between DBSP engine and differential dataflow

Opened this issue · 2 comments

Hi, I'm very interested in incremental computation and I have learned dd for a while. I found dbsp some days ago and it seems can do the same things as dd. What's differences between them, when should we use dbsp?

This is a complicated question because it can be interpreted in many ways and the answer really depends on your needs.

DD is an open-source library, but, as you probably know, it is the core of the commercial Materialize.com offering. I don't know much about that product, so I won't comment about it. My answer applies to the open-source side.

The DBSP paper https://github.com/vmware/database-stream-processor/blob/main/doc/vldb23/main.pdf does have a short comparison with DD, but that is really about the theoretical side. I quote from Section 9:

DBSP is also related to Differential Dataflow (DD) [39, 42] and
its theoretical foundations [3] (and recently [17, 38]). DD’s compu-
tational model is more powerful than DBSP, since it allows time
values to be part of an arbitrary lattice. In fact, DD is the only
other framework which we are aware of that can incremental-
ize recursive queries as efficiently as DBSP does. In contrast, our
model uses either “linear” times, or nested time dimensions via the
modular lifting transformer (↑). DBSP can express both incremen-
tal and non-incremental computations. Most importantly, DBSP
comes with Algorithm 4.6, a syntax-directed translation that can
convert any expressible query into an incremental version — in DD
users have to assemble incremental queries manually using incre-
mental operators. (materialize.com offers a product that automates
incrementalization for SQL queries based on DD. Differential Data-
log [45] does it for a Datalog dialect.) Unlike DD, DBSP is a modular
theory, which easily accommodates the addition of new operators:
as long as we can express a new operator as a DBSP circuit, we can
(1) define its incremental version, (2) apply the incrementalization
algorithm to obtain an efficient incremental implementation, and
(3) be confident that it composes with any other operators.

In practical terms, in our benchmarks we are slightly faster than DD (10-20%). We also provide an open-source SQL compiler on top of DBSP (https://github.com/vmware/sql-to-dbsp-compiler). We do believe that the modularity of our theory does allow easier extensions, and we have added some operators that DD does not have (e.g., Window-based aggregation).

In terms of debugging, DD has been around for a long time, so supposedly it is more tested. However, the core logic of our code is simpler; these lattice-based times are not easy to understand. Our main weakness right now may be documentation, which we are planning to improve significantly.

I will be glad to give a more precise answer if you can elaborate on your use cases.

Let me add though, that if you are interested in incremental computation in general you must read our paper even if you don't plan to use the software. (Or you can watch the video in our README for a simplified description.) It provides a new perspective on this problem, which we believe opens up many new interesting theoretical and practical directions.