PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
Getting started with Clojure and PigPen is really easy.
- The wiki explains what PigPen does and why we made it
- The tutorial is the best way to get Clojure and PigPen installed and start writing queries
- The full API lists all of the operators with example usage
- PigPen for Clojure users is great for Clojure users new to map-reduce
- PigPen for Pig users is great for Pig users new to Clojure
Note: If you are not familiar at all with Clojure, I strongly recommend that you try a tutorial here, here, or here to understand some of the basics.
Note: PigPen is not a Clojure wrapper for writing Pig scripts you can hand edit. While entirely possible, the resulting scripts are not intended for human consumption.
pigpen
is available from Maven:
With Leiningen:
[com.netflix.pigpen/pigpen "0.2.5"]
With Gradle:
compile "com.netflix.pigpen:pigpen:0.2.5"
With Maven:
<dependency>
<groupId>com.netflix.pigpen</groupId>
<artifactId>pigpen</artifactId>
<version>0.2.5</version>
</dependency>
Note: PigPen requires Clojure 1.5.1 or greater
-
0.2.5
- Remove
dump&show
anddump&show+
in favor ofpigpen.oven/bake
. Callbake
once and pass to as many outputs as you want. This is a breaking change, but I didn't increment the version becausedump&show
was just a tool to be used in the REPL. No scripts should break because of this change. - Remove
dymp-async
. It appeared to be broken and was a bad idea from the start. - Fix self-joins. This was a rare issue as a self join (with the same key) just duplicates data in a very expensive way.
- Clean up functional tests
- Fix
pigpen.oven/clean
. When it was pruning the graph, it was also removing REGISTER commands.
- Remove
-
0.2.4
- Fix arity checking bug (affected varargs fns)
- Fix cases where an Algebraic fold function was falling back to the Accumulator interface, which was not supported. This affected using
cogroup
withfold
over multiple relations. - Fix debug mode (broken in 0.1.5)
- Change UDF initialization to not rely on memoization (caused stale data in REPL)
- Enable AOT. Improves cluster perf
- Add
:partition-by
option todistinct
-
0.2.3
- Added
load-json
,store-json
,load-string
,store-string
- Added
filter-by
, andremove-by
- Added
-
0.2.2
- Fixed bug in
pigpen.fold/vec
. This would also causefold/map
andfold/filter
to not work when run in the cluster.
- Fixed bug in
-
0.2.1
- Fixed bug when using
for
to generate scripts - Fixed local mode bug with
map
followed byreduce
orfold
- Fixed bug when using
-
0.2.0
- Added pigpen.fold - Note: this includes a breaking change in the join and cogroup syntax as follows:
; before (pig/join (foo on :f) (bar on :b optional) (fn [f b] ...)) ; after (pig/join [(foo :on :f) (bar :on :b :type :optional)] (fn [f b] ...))
Each of the select clauses must now be wrapped in a vector - there is no longer a varargs overload to either of these forms. Within each of the select clauses, :on is now a keyword instead of a symbol, but a symbol will still work if used. If
optional
orrequired
were used, they must be updated to:type :optional
and:type :required
, respectively. -
0.1.5
- Performance improvements
- Implemented Pig's Accumulator interface
- Tuned nippy
- Reduced number of times data is serialized
- Performance improvements
-
0.1.4
- Fix sort bug in local mode
-
0.1.3
- Change Pig & Hadoop to be transitive dependencies
- Add support for consuming user code via closure
-
0.1.2
- Upgrade instaparse to 1.2.14
-
0.1.1
- Initial Release