/MapReduce

In C#. Master-Worker. From scratch. No Hadoop. Done. Depend on DFS.

Primary LanguageC#MIT LicenseMIT

MapReduce

  • Distributed systems.
  • Object-oriented programming.
  • Educational only.

How to run

dotnet run -p src/MapReduce.Sample

How to use

Principle

map():

part of object -> list<(key, value)>
return list<(key, value)>

combine():

hash<key, list<value>>
foreach ((key,value) in list<(key, value)>)
{
    hash<key, list<value>>[key].Add(value)
}
return hash<key, list<value>>

partition():

hash<partitionIndex, hash<key, list<value>>>

reduce():

hash<key, valueAggregated>
foreach ((key,values) in hash<key, list<value>>)
{
    foreach (value in values)
    {
        hash<key, valueAggregated>[key] += value
    }
}
// foreach (key,value) in other list<(key, value)>
// omitted
return hash<key, valueAggregated>

  • each intermediate file is a partition.
  • ith reducer take every ith partition in each mapper's local disk.

Master Data Structure

  • class master
    • List<MapTask>
    • List<ReduceTask>
    • List<Worker>
  • relative data structures
    • enum state { idle, in-progress, completed }
      • idle:
        • task waiting to be scheduled.
        • the task is not done yet.
    • class MapTask { state, CompletedFile, ... }
    • class ReduceTask { state, CompletedFile, ... }
    • class CompletedFile { location, size }

Failure

  • worker failure
    • master pings worker.
      • no response in amount of time -> worker failed.
  • master failure
    • exception on user code.
    • master writes data structures in checkpoints periodically.
    • master gives the same task to a different worker.

Use Cases

  • When map worker completes a map task
    1. worker ---{file names}--> master.
    2. master saves file names to data structure.
  • When reduce worker completes a reduce task
    1. rename temp output file to final output file.
  • Task processing
    • worker
      1. The workers talk to the master via RPC.
      2. worker ask the master for a task
      3. worker read the task's input from one or more files,
      4. worker executes the task,
      5. worker writes the task's output to one or more files.

Partitioning

Assignment

Your job is to implement a distributed MapReduce, consisting of two programs, the master and the worker. There will be just one master process, and one or more worker processes executing in parallel. In a real system the workers would run on a bunch of different machines, but for this lab you'll run them all on a single machine. The workers will talk to the master via RPC. Each worker process will ask the master for a task, read the task's input from one or more files, execute the task, and write the task's output to one or more files. The master should notice if a worker hasn't completed its task in a reasonable amount of time (for this lab, use ten seconds), and give the same task to a different worker.