/ops-hadoop

Primary LanguageJavaApache License 2.0Apache-2.0

OPS

Build Status GitHub

OPtimized Shuffle is a distributed shuffle management system which focuses on optimizing the shuffle phase of DAG computing frameworks. The name OPS is originated from the ancient Roman religion, she was a fertility deity and earth goddess of Sabine origin.

System Overview

Architecture

Figure 1: OPS Architecture

Internal View

Figure 2: Internal View Comparison between Legacy Hadoop and Hadoop with OPS

LifeCycle

Figure 3: Lifecyle of OPS

Structure

Figure 4: OPS Structure

ETCD File Structure

  • ops/
    • jobs/
      • job-${jobId}: JobConf
    • mapTaskFirstAlloc/
      • mapTaskFirstAlloc-${jobId}: MapTaskAlloc
    • mapTaskSecondAlloc/
      • mapTaskSecondAlloc-${jobId}: MapTaskAlloc
    • reduceTaskAlloc/
      • reduceTaskAlloc-${jobId}: ReduceTaskAlloc
    • shuffle/
      • reduceNum/
        • reduceNum-${nodeIp}-${jobId}-${reduceId}: num
      • mapCompleted/
        • mapCompleted-${nodeIp}-${jobId}-${mapId}: MapReport
      • shuffleCompleted/
        • shuffleCompleted-${dstNodeIp}-${jobId}-${num}-${mapId}: path
      • indexRecords/
        • indexRecords-${nodeIp}-${jobId}-${mapId}: CollectionConf
    • tasks/
      • reduceTasks/
        • reduceTask-${nodeIp}-${jobId}-${reduceId}: ReduceConf

Compatibility

In order to use OPS, the modification on DAG computing frameworks is inevitable. For now, we only implement OPS on Hadoop MapReduce. Our customized Hadoop is available in here. We believe that the costs of enabling OPS on other DAG computing frameworks are also very low.

Build

$ mvn clean install

Run

OPS: Optimized Shuffle

Usage:
  ops.sh [command]

Commands:
  master         OpsMaster
  worker         OpsWorker

Options:
  -h, --help     Show usage

Use ops.sh [command] --help for more information about a command.

License

Apache License 2.0