ngrunwald/datasplash

ParDo question/clarification/example

Closed this issue · 4 comments

Hello, I'm rather new to Dataflow and have tried this library out for some data processing. It's awesome and thank you! I have a question, can you clarify how/have some examples of how to use the raw pardo-fn within a Clojure program? How is this different from say just using map?

RolT commented

Hi, pardo gives you access to the whole ProcessContext and not just the element, so you can access things like timestamp, pane info, window and side input. And it lets you handle the output the way you want, you can output 0 or more elements. So you could implement filter/map/mapcat with pardo.
It can be useful for debugging, but I think having to use it in a program would mean that datasplash is incomplete.

Disclaimer: I'm not a datasplash expert.

Great, thank you for the information and insight. Do you have any examples that you would be able to share where you've used it for debugging as you mention?

RolT commented

For instance debugging windowing:

(->> (ds/make-pipeline {})
     (ds/read-json-file "myfile.ndjson" {:key-fn true})
     (ds/map (fn [e] (ds/with-timestamp (clj-time.core/now) e)))
     (ds/fixed-windows (clj-time.core/seconds 30))
     (ds/pardo (fn [c] (.output c (assoc (.element c) :timestamp (str (.timestamp c)) :pane (str (.pane c)) :window (str (.window c))))))
     (ds/write-json-file "test.ndjson")
     (ds/run-pipeline))

This is great, thank you for the example!..