Netflix/PigPen

Use cascading-hadoop2-mr1 by default

fs111 opened this issue · 10 comments

Cascading supports multiple backends and we recommend the Hadoop 2 (YARN) platform for everyone these days. It would be great if pigpen-cascading could move to that. Let me know, if you want a PR for that or if something forces you to stay on Hadoop 1.x.

I'm fine with upgrading. It's using hadoop 1 because that's what existed when I originally wrote PigPen and it just got copied around. On the pig side, it's compiled as a provided dependency, so you can really use whatever you want on the cluster. Would it make sense to change the reference in pigpen-cascading to provided also? What's the best practice for cascading there - should hadoop and cascading be included in the uberjar that a user submits?

Also, I imagine that targeting hadoop 2 isn't going to break backwards compatibility for users still running hadoop 1 cluster?

Cascading should be in the jar you submit, hadoop is meant to be set to provided.

Cool - I can make that change. I'll update the pig version at the same time. Is there a specific version you'd like to see?

the thing is a little bit more complicated: Cascading has a notion of platforms, on which it can run. Each platform is the abstraction of a runtime, it has it's own FlowConnector etc etc. For Cascading 2.x, we have 3 platforms:

  • cascading-local: local in memory. no hadoop at all
  • cascading-hadoop: hadoop 1.x
  • cascading-hadoop2mr1: hadoop 2.x

In Cascading 3, we will have more platforms, notably

  • cascading-hadoop2-tez: using Tez instead of MapReduce

Others are in the works, from what I know.

The thing is, that those platforms are mutually exclusive. While you can move one cascading app onto a new platform by changing the flow connector, they cannot all be on the classpath at once. A user has to ship with one.

To make a long answer short, I would move to cascading-hadoop2-mr1 (compile scope) on Hadoop 2.6. (provided scope) and if somebody wants to move to a new platform, they can do the excludes manually.

Does that make sense?

I think so... here's what I'm thinking - let me know if it's crazy.

In the pigpen build, I'll mark both the hadoop and cascading dependencies as provided. We'll then suggest in the tutorial that users choose the cascading version/platform they want (as a compile dep) and the hadoop version they want (as a provided dep).

Do you think that'll work? That way they can pick whatever versions of cascading and hadoop, and it's up to them to make sure it ends up in the uberjar they ship to hadoop.

It makes sense to do it that way only if you only use cascading APIs. While most platforms have Hadoop ties, it is perfectly possible to have a platform without Hadoop at all. Our local platform is one example. In that case the idea does not work.

Maybe the safest thing to do is shipping a direct dep. to cascading-hadoop2-mr1 by default and let experts use it on other platforms, if they desire. That makes it easy for people to get started and we can then with he community take it from there.

Ahhh - I think I get it. Making hadoop a provided dependency isn't great because some platforms don't use hadoop at all, so if those jars aren't present, they can't run because our code uses hadoop. It's not so much a question of which hadoop to use; more of a question of to hadoop or not to hadoop.

I can think of two ways around this...

  1. We would have a common pigpen-cascading jar that would have the core pigpen > cascading flow conversion, and then separate jars (such as pigpen-cascading-hadoop2-mr1) that supply a FlowConnector, call (.connect ...), and add the appropriate compile dependencies. The upside to this is that we're not adding to jar hell for the user - they don't have to exclude hadoop if they don't want it. The downside is that we have to maintain one of these shim jars for every platform.

  2. Instead of pigpen doing the final connect, we would simply return a FlowDef and instruct the user how to wire it up. They supply their connector, they supply the properties (issue #161), and they call connect. The upside is that there's only a single pigpen-cascading jar we have to maintain, and it's not dependent on anything hadoop related. The downside is that the user has to provide some boilerplate code to get going, but that's really just adding the (.connect ...) call to the tutorial here: https://github.com/Netflix/PigPen/wiki/Getting-Started-for-Cascading-Users#running-a-flow

My preference is for option 2. I think it's simpler and easier to maintain long term. I feel like the fewer stakes we put in the ground regarding versions & libraries the better.

@pkozikow I'd love to hear your thoughts on this. Do you have any concerns with that change?

I think there might be one small hiccup to this plan though - it looks like the default taps for reading/writing strings are using some hadoop related classes: https://github.com/Netflix/PigPen/blob/master/pigpen-cascading/src/main/clojure/pigpen/cascading/core.clj#L94

Does this imply that taps are tied to the platform being used? Would different platforms use different taps?

And of course there is option 3: stick with the current approach and hard-wire it to use hadoop2. It is an option, but I'd rather fix this the right way now, before people start using it. Once people start using it, it's a lot harder to change stuff like this.

According to Cascading's documentation, different platforms do indeed use different taps: "... using platform-specific Tap and Scheme classes that hide the platform-related I/O details from the developer". This being the case, I don't see a nice way of implementing all the built-in PigPen load/store functions without using platform-specific code under the hood when building the Cascading flow. It might be possible to do some sort of dynamic class loading at runtime to detect the platform and use the appropriate taps, but that sounds rather ugly and complicated. Besides, I would say that at least in the short term by and large everyone will want to use the hadoop platform. The loss of the local platform is not significant since PigPen has its own local mode anyway. That would mean shipping a direct dep on hadoop and let experts customize with excludes manually, like Andre suggested.

Fair enough - option 3 is the easier change :)

I'm still having problems with hadoop 2, but I'll follow up on the other thread.