nathanmarz/cascalog

Prepared operations are confusing

ipostelnik opened this issue · 9 comments

Functions declared using (prepfn) are treated by cascalog as vanila clojure funcitons and so behave as regular map/filter functions. If you want to make a prepared mapcat/buffer/aggregator you have to write:

(def prepped-mapcat-op (mapcatop (prepfn [fp call] ...)))

rather than returning a mapcatfn out of prepfn as examples imply. In fact, there's never a reason to use any special *fn's in the body of a prepared function.

We should come up with a better syntax for these or make better docs.

Yeah, prepfn transforms inner calls to fn into the serializable version - if you're writing fns inline, you're correct, but if you define the internal fns in a let binding outside, like

(let [op (fn [x] ....)]
  (prepfn [fp call] op))

It won't work. So, sort of tricky.

I'm open to other ideas for syntax for sure. One idea is to define prep versions of all the macros, but that seems janky (defprepmapcatfn, crazy times!)

I think at a minimum we should clarify that prepfn is a peer of s/fn, so effectively it's just a special kind of vanilla function. Using it in other contexts requires lifting via macatop/bufferop/etc...

@ipostelnik a lot of prepfn impls I've seen exist to get access to counters. what do you think of #270 as a nicer way to get access to this stuff?

I really like the stats implementation that's hides the guts of hadoop
counters.

We have 2 use cases for prepared functions - counters and (effectively)
simulating hash joins. We have a lot of variants of the latter that use
richer data structures and logic than what plain hash-join allows.

On Fri, Feb 20, 2015 at 9:48 AM, Sam Ritchie notifications@github.com
wrote:

@ipostelnik https://github.com/ipostelnik a lot of prepfn impls I've
seen exist to get access to counters. what do you think of #270
#270 as a nicer way to get
access to this stuff?


Reply to this email directly or view it on GitHub
#269 (comment).

Nice. Custom hash joins are actually my next thing I wanted to play with. Would love some input on what you guys are doing, if you have any examples you might share.

@sritchie here is a goofy (cascalog v1) example:

(deffilterop ^:stateful user-can-enter-party?
  "A map-op which reads/parses a complex object in distributed cache to create a map of party-name->participant-set in order to filter out users in line at a party"
  ([] (-> (read-cached-party-information) make-party->user-set))

  ([party->user-set party user]
    (let [user-set (get party->user-set party)]
      (contains? user-set user)))

  ([_]))

After more thinking about this - the big problem is in the java Cascading/Clojure bridge. We need to know at query planning time to either emit ClojureMapcat or ClojureMap operation. Instead, we should use function metadata to decide how to translate return value into tuples.

Oh, that's interesting. Yeah, I guess we could access that metadata from within Java. Any interest in trying that out, @ipostelnik ?

I ended up writing a couple of macros modeled after XXXop and defXXXfn. See here for code https://gist.github.com/ipostelnik/1d5566322fa1dec97b0a

I also wrote simple wrappers that lift get (as map and mapcat) and contains? into stateful ops using state loaded into a map or set.