/Spark_MR_design_patterns

Implementation of MapReduce patterns in Spark Pyspark

Primary LanguagePython

Spark_MR_design_patterns

Implementation of MapReduce patterns in Spark Pyspark

Summarization pattern

  • Min, max and count

Filter pattern

  • Bloom filter
  • Top 10
  • Distinct

Data organization pattern

  • structured to hirerachical
  • Partitioning
  • Binning
  • Shuffling

Join pattern

  • Map-side join
  • Reduce-side join
  • Replicated join
  • composite join
  • Cartesian join

Dataset: cs stackexcange dataset

Reference: MapReduce Design Patterns, Building Effective Algorithms and Analytics for Hadoop and Other Systems By Donald Miner, Adam Shook