TimelyDataflow/timely-dataflow

Extracting results

vorner opened this issue · 2 comments

Hello

I'm trying to grasp how the timely dataflow is meant to be used kind of end to end. The documentation concentrates on how to do and structure the computation and its good at explaining that. But I kind of still don't get how to connect it to the rest of the world.

To illustrate what I mean, let's say I have a huge source of data ‒ let's say live stream of log messages or metrics or something like that. So, the whole thing will do something like this:

  • I have bunch of worker processes across machines.
  • They either read a partition of the stream each, or one „master“ reads them and inserts them into the system and it'll distribute through the workers.
  • It performs the computation, producing let's say events of style „this service entered alert state“ and „this service left alert state“.

However, let's say I want a static web page containing all the services in alert state at the currently newest timestamp. This sounds challenging, because:

  • The information about the state of different machines is scattered through the cluster, not at one place.
  • They may be coming out of order (OK, that probably can be somehow worked around by buffering in a custom operator).

I guess I could somehow force my way through it, but all the ideas I have in mind look like huge hacks. Is there a correct way to do such things? Should it be part of the book? Because all the „exporting“ of the results is done by println ‒ that'll end up in a random worker, right?

To control the flow of outputs back to where you want it, you can use the exchange operator. If your Timely workers are accepting external clients, you would for example keep track of which requests are owned by which worker, and then exchange based on that. This ensures that all relevant outputs will be routed to the correct worker.

You can also see exchange being used here and here.

Dealing with out-of-order data is basically Timely's raison d'être!

Take a look (for example) at what Timely's sequencing primitive is doing within its sink operator.

It makes use of Timely's progress tracking, specifically the input.frontier() – a set of times that all workers have completed, explained here, in order to ensure consistent handling of outputs.

Yes, that's about what I imagined as the hack.

However, my point was more in the sense the book that should teach people how to use it should actually mention these issues and point in the right direction.