/httpmr

Automatically exported from code.google.com/p/httpmr

Primary LanguagePython

= HTTP Map/Reduce: A scalable data processing framework for people with web clusters. =

_Status: Beta_

HTTPMR is an implementation of Google's famous Map/Reduce data processing model on clusters of HTTP servers.

HTTPMR tries to make only the following assumptions about the computing environment:

  * Machines can be accessed only via HTTP requests.
  * Requests are assigned randomly to a set of machines.
  * Requests have timeouts on the order of several seconds.
  * There is a storage system that is accessible by code receiving HTTP requests.
  * The data being processed can be broken up into many, many small records, each having a unique identifier.
  * The storage system can accept >, <= range restrict operations on the data's unique identifiers.
  * Jobs are controlled by a web spidering system (such as wget).

Driven primarily by the needs of users of Google AppEngine (http://appengine.google.com/) for a robust data processing system, HTTMR will hopefully be written in a general-enough way to work in many web clusters.  Bringing HTTMR up in a new environment should require only implementing a few interfaces to the data storage system.

= Example: =

{{{
import wsgiref
from google.appengine.ext import webapp
from httpmr import appengine
from httpmr import base
from wsgiref import handlers
from google.appengine.ext import db


class Document(db.Model):
  title = db.StringProperty(required=True)
  contents = db.TextProperty(required=True)


class DocumentIndex(db.Model):
  token = db.StringProperty(required=True)
  document_titles = db.StringListProperty()


class TokenMapper(base.Mapper):
  
  def Map(self, document_title, document):
    for token in list(set(document.contents.split(" "))):
      if token:
        yield token, document_title


class TokenReducer(base.Reducer):
  
  def Reduce(self, token, document_titles):
    yield None, DocumentIndex(token=token,
                              document_titles=document_titles)


class ConstructDocumentIndexMapReduce(appengine.AppEngineMaster):
  
  def __init__(self):
    self.QuickInit("construct_token_index",
                   mapper=TokenMapper(),
                   reducer=TokenReducer(),
                   source=appengine.AppEngineSource(Document.all(),
                                                    "title"),
                   sink=appengine.AppEngineSink())


def main():
  application = webapp.WSGIApplication([('/construct_document_index',
                                         ConstructDocumentIndexMapReduce)],
                                       debug=True)
  wsgiref.handlers.CGIHandler().run(application)

if __name__ == "__main__":
  main()
}}}

A specific URL is then mapped to HandleMapReduce, and you're off to the races!