mrjob
mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you to buy time on a Hadoop cluster on an hourly basis. It also works with your own Hadoop cluster.
Some important features:
- Run jobs on EMR, your own Hadoop cluster, or locally (for testing).
- Write multi-step jobs (one map-reduce step feeds into the next)
- Duplicate your production environment inside Hadoop
- Upload your source tree and put it in your job's
$PYTHONPATH
- Run make and other setup scripts
- Set environment variables (e.g.
$TZ
) - Easily install python packages from tarballs (EMR only)
- Setup handled transparently by
mrjob.conf
config file
- Upload your source tree and put it in your job's
- Automatically interpret error logs from EMR
- SSH tunnel to hadoop job tracker on EMR
- Minimal setup
- To run on EMR, set
$AWS_ACCESS_KEY_ID
and$AWS_SECRET_ACCESS_KEY
- To run on your Hadoop cluster, install
simplejson
and make sure$HADOOP_HOME
is set.
- To run on EMR, set
Installation
From PyPI:
pip install mrjob
From source:
python setup.py install
A Simple Map Reduce Job
Code for this example and more live in mrjob/examples
.
"""The classic MapReduce job: count the frequency of words. """ from mrjob.job import MRJob import re WORD_RE = re.compile(r"[\w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def combiner(self, word, counts): yield (word, sum(counts)) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run()
Try It Out!
# locally python mrjob/examples/mr_word_freq_count.py README.rst > counts # on EMR python mrjob/examples/mr_word_freq_count.py README.rst -r emr > counts # on your Hadoop cluster python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop > counts
Setting up EMR on Amazon
- create an Amazon Web Services account
- sign up for Elastic MapReduce
- Get your access and secret keys (click "Security Credentials" on your account page)
- Set the environment variables
$AWS_ACCESS_KEY_ID
and$AWS_SECRET_ACCESS_KEY
accordingly
Advanced Configuration
To run in other AWS regions, upload your source tree, run make
, and use
other advanced mrjob features, you'll need to set up mrjob.conf
. mrjob looks
for its conf file in:
- The contents of
$MRJOB_CONF
~/.mrjob.conf
/etc/mrjob.conf
See the mrjob.conf documentation for more information.
Links
- source: <http://github.com/Yelp/mrjob>
- documentation: <http://packages.python.org/mrjob/>
- discussion group: <http://groups.google.com/group/mrjob>
- Hadoop MapReduce: <http://hadoop.apache.org/mapreduce/>
- Elastic MapReduce: <http://aws.amazon.com/documentation/elasticmapreduce/>
- PyCon 2011 mrjob overview: <http://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2011-mrjob-distributed-computing-for-everyone-4898987/>
Thanks to Greg Killion (blind-works.net) for the logo.