/mapreduce-design-patterns

Hadoop ecosystem recipes for ETL, common data transformations, & iterative algorithms

mapreduce-design-patterns

Hadoop ecosystem recipes for common data transformations & iterative algorithms

As with RDBMS, MapReduce deals with <key, multiValue> pairs, i.e. tuples of data.

As in Dplyr in R, the basic data transformation operations are:

  • Filter
  • Sort (done via the shuffle stage)
  • Aggregate (count, sum, average, etc.)
  • Remap / Rename / Re-order
  • Intersect (join)
  • Group By (done in reducer, once like keys have been grouped on the same Reducer node via the Shuffle step)