django-hadoop: A Python repository from dadasoz

Hadoop integration for Django (through an Oozie REST API or local job execution). This code allows running MapReduce tasks from the Django views.

Installation

Install this django app as usual (urls.py, settings.py, etc.).
Prepare common Hadoop-related settings in your project's settings.py:

HADOOP_MAIN         = 'node'
NAMENODE            = 'hdfs://%s:8020' % HADOOP_MAIN          # Hadoop namenode
JOB_USER            = 'oozie'                                 # Hadoop user for jobs & HDFS stuff
JOB_MANAGER_CLASS   = 'your_app.your_module.CustomJobManager' # your JobManager subclass

Choose job runner

a. Oozie job runner (submits MR-jobs through an Oozie) [RECOMMENDED] settings:

OOZIE_SERVER        = 'http://%s:11000' % HADOOP_MAIN         # Oozie RESTful server
HDFS_APP_DIR        = '/user/%s/your-app-in-hdfs' % JOB_USER  # Oozie application dir in HDFS
HDFS_APP_NAME       = 'YourHadoopApp.jar'                     # Oozie application name (in HDFS)

Put Oozie job configuration data to HDFS (*.jar, workflow.xml). Add a Site in django admin with ip/domain reachable from the host running Oozie, then setup SITE_ID in your project settings file.

b. Local job runner (submits MR-jobs locally through the pipe) settings:

HADOOP_HOME         = '/usr/lib/hadoop-0.20'                  # path to Hadoop client home 
JOB_JAR_PATH        = '/home/%s/YourHadoopApp.jar'            # path to jar on the local FS 
HADOOP_JOB_CMD      = '%s/bin/hadoop jar %s' % (HADOOP_HOME,  # Hadoop command for running the job
                                                JOB_JAR_PATH)

Install hadoop client for reading from HDFS (required in both cases for reading job results).
[OPTIONAL] Add crontab entry to run periodically python manage.py process_jobs. This command should start new (and failed) jobs from database.

JobManager customization

JobManager customization could be made through inheritance:

class JobManager(object):
    _job_runner = RestJobRunner             # override with your custom runner (non-obligatory)
    _job_model = CommonJob                  # override with your custom model (non-obligatory)
    _job_result_parser = DummyResultParser  # override with your custom result parser (required)

Processing results

Result parser could be subclassed from results.JobResultParser.

class CustomJobManager(JobManager):
    _job_result_parser = CustomResulParser # your result parser implementation

Implement parse_results method and do everything you wish with self._result_dict.

Changing runner behaviour

Job runner could be inherited from:

RestJobRunner implements Oozie job runner.
LocalJobRunner implements local job runner.

Example

        job = CustomJobManager.get_model().create()           # create model instance
        rest_job_runner = CustomJobManager.get_runner()(job)  # create job runner instance
        succeeded = rest_job_runner.run_job()                 # start a job

You can get job model, runner and result parser via JobManager class methods:
get_model(),
get_runner()
get_result_parser()

Task state could be determined from JSON by getting task view. To manually update task status, just GET 'hadoop-notification-view' (pass hadoop_job_id and status variables). This view is called by an Oozie automatically upon status change.

Admin

All your model fields are exposed to admin with the help of ExposeAllFieldsMixin. You can register your own ModelAdmin, if you don't like this behaviour.

Tested with hadoop 0.20.2-cdh3u5 and django 1.4 (1.75).

P.S. There's a lot of things to do. Just let me know, if you want some feature.

dadasoz/django-hadoop

Installation

JobManager customization

Example

Admin