/scadoop

Yet another thin Scala wrapper for Hadoop

GNU Lesser General Public License v3.0LGPL-3.0

Scadoop

By Jonathan Clark

One day, the obscene verbosity of writing MapReduce applications using the Hadoop Java API caused me to fly into a fit of blinding rage. Ergo, Scadoop.

Introduction

We draw heavily from David Hall's SMR and the Python Dumbo module.

What Scadoop does differently:

  • We use implicit string to text conversion
  • You only need to declare the writable types of your mapper reducer once. ever.
  • You can specify closures for your mapper and reducer. They will be serialized over the wire without the need for mangling into configuration strings.
  • Uses the Java API directly making it faster than dumbo while requiring far less code than writing directly in Java.
  • For I/O bound jobs, the Java API (and therefore Scadoop) should be faster than even C++ with Hadoop Pipes.

Let's have a look classic wordcount example written in Scadoop. If you have a lot of extra reading time on your hands, you can also have a look at the original Java Hadoop API version at http://wiki.apache.org/hadoop/WordCount.

@serializable object WordCountApp extends ScadoopApp {
  val prefix = "#" // magically gets passed to all mappers/reducers via closure serialization
  val one = new IntWritable(1) // don't recreate this every time

  def mapper(records: Iterator[(LongWritable,Text)], c: MapContext): Iterator[(Text,IntWritable)]
    = for( (xxx, line) <- records; tok <- line.split(' ').toIterator) yield (new Text(prefix+tok), one)
  def reducer(records: Iterator[(Text,Iterator[IntWritable])], c: ReduceContext): Iterator[(Text,IntWritable)]
    = for( (key, values) <- records) yield (key, new IntWritable(values.map(_.get).sum))

  val pipeline = Pipeline.add("Word Count Job", mapper, reducer, combiner=Some(reducer _))
  exit(pipeline.runWithExitCode(inDir = args(0), outDir = args(1), tmpDir = ""))
}

NOTE: While I have done some basic tests on this code, they have only been surface-level. A full test suite and thorough testing in many environments is still pending. You should help. :-) Let me know how things go for you. Until then, I recommend doing a bit of testing before deploying in any critical applications.

Building

set HADOOP_HOME=/the/path
set SCALA_HOME=/the/path
ant

Running the Example

./run_example.sh
less example/out/part-r-*