/PigPen

Map-Reduce for Clojure

Primary LanguageClojureApache License 2.0Apache-2.0

PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.

Getting Started, Tutorials & Documentation

Getting started with Clojure and PigPen is really easy.

  • The wiki explains what PigPen does and why we made it
  • The tutorial is the best way to get Clojure and PigPen installed and start writing queries
  • The full API lists all of the operators with example usage
  • PigPen for Clojure users is great for Clojure users new to map-reduce
  • PigPen for Pig users is great for Pig users new to Clojure

Note: If you are not familiar at all with Clojure, I strongly recommend that you try a tutorial here, here, or here to understand some of the basics.

Note: PigPen is not a Clojure wrapper for writing Pig scripts you can hand edit. While entirely possible, the resulting scripts are not intended for human consumption.

Questions & Complaints

Artifacts

pigpen is available from Maven:

With Leiningen:

[com.netflix.pigpen/pigpen "0.2.5"]

With Gradle:

compile "com.netflix.pigpen:pigpen:0.2.5"

With Maven:

<dependency>
  <groupId>com.netflix.pigpen</groupId>
  <artifactId>pigpen</artifactId>
  <version>0.2.5</version>
</dependency>

Note: PigPen requires Clojure 1.5.1 or greater

Release Notes

  • 0.2.5

    • Remove dump&show and dump&show+ in favor of pigpen.oven/bake. Call bake once and pass to as many outputs as you want. This is a breaking change, but I didn't increment the version because dump&show was just a tool to be used in the REPL. No scripts should break because of this change.
    • Remove dymp-async. It appeared to be broken and was a bad idea from the start.
    • Fix self-joins. This was a rare issue as a self join (with the same key) just duplicates data in a very expensive way.
    • Clean up functional tests
    • Fix pigpen.oven/clean. When it was pruning the graph, it was also removing REGISTER commands.
  • 0.2.4

    • Fix arity checking bug (affected varargs fns)
    • Fix cases where an Algebraic fold function was falling back to the Accumulator interface, which was not supported. This affected using cogroup with fold over multiple relations.
    • Fix debug mode (broken in 0.1.5)
    • Change UDF initialization to not rely on memoization (caused stale data in REPL)
    • Enable AOT. Improves cluster perf
    • Add :partition-by option to distinct
  • 0.2.3

    • Added load-json, store-json, load-string, store-string
    • Added filter-by, and remove-by
  • 0.2.2

    • Fixed bug in pigpen.fold/vec. This would also cause fold/map and fold/filter to not work when run in the cluster.
  • 0.2.1

    • Fixed bug when using for to generate scripts
    • Fixed local mode bug with map followed by reduce or fold
  • 0.2.0

    • Added pigpen.fold - Note: this includes a breaking change in the join and cogroup syntax as follows:
    ; before
    (pig/join (foo on :f)
              (bar on :b optional)
              (fn [f b] ...))
    
    ; after
    (pig/join [(foo :on :f)
               (bar :on :b :type :optional)]
              (fn [f b] ...))

    Each of the select clauses must now be wrapped in a vector - there is no longer a varargs overload to either of these forms. Within each of the select clauses, :on is now a keyword instead of a symbol, but a symbol will still work if used. If optional or required were used, they must be updated to :type :optional and :type :required, respectively.

  • 0.1.5

    • Performance improvements
      • Implemented Pig's Accumulator interface
      • Tuned nippy
      • Reduced number of times data is serialized
  • 0.1.4

    • Fix sort bug in local mode
  • 0.1.3

    • Change Pig & Hadoop to be transitive dependencies
    • Add support for consuming user code via closure
  • 0.1.2

    • Upgrade instaparse to 1.2.14
  • 0.1.1

    • Initial Release