Apache Spark with WASP

WASP is a workload-aware task scheduler and partitioner for in-memory MapReduce framework. Our scheduler contains:

Analytical prediction model for predictioin of spark.default.parallelism and SPARK_WORKER_CORES parameters
Runtime monitoring CPU utilization, spill, GCs
Scheduler that maxmizes CPU utilization whil minimizing the overhead of data spills and GCs

What is WASP

WASP jointly optimizes N_partitions and N_threads at runtime, which parameters are defined as:

N_partitions: how many data partitions are created from a single RDD (spark.default.parallelism)
N_threads: how many threads are allocated to a single executor (SPARK_WORKER_CORES)

Spark often suffers performance degradation with suboptimal N_partitions and N_threads parameters (e.g. typical guidelines suggest to use 2-3 tasks per CPU core for N_threads) . Usually, these two parameters are set empirically by users, which yield suboptimal performance due to too high memory pressure or underutilization of concurrency. WASP firstly predicts N_partitions and N_threads with analystical models. And then, monitors memory pressure and concurrency at runtime and dynamically tunes the N_partitions and N_threads. Thus, WASP achieves much faster execution time and high resource utilization compared to unoptimized Spark.

How to Operate?

Add 3 options in HiBench (or other configuration file)
- spark.input.size: estimated data size in hadoop (or other DFS)
- spark.total.executor.number: total number of executors in your cluster
- spark.total.core.number: total number of cores in one executor
Possible Spark version
- 1.6.1, 2.0.1
Possible benchmark
- WordCount, Bayes, Kmeans, TeraSort, Sort, PageRank (HiBench v5.0)

Demo Video

Citation

Please cite the following paper if you use WASP:

Jointly Optimizing Task Granularity and Concurrency for In-Memory MapReduce Frameworks. Jonghyun Bae, Hakbeom Jang, Wenjing Jin, Jun Heo, Jaeyoung Jang, Joo-Young Hwang, Sangyeun Cho and Jae W. Lee. Proceedings of the 2017 IEEE International Conference on Big Data (Big Data).

@INPROCEEDINGS{8257921,
  author={Bae, Jonghyun and Jang, Hakbeom and Jin, Wenjing and Heo, Jun and Jang, Jaeyoung and Hwang, Joo-Young and Cho, Sangyeun and Lee, Jae W.},
  booktitle={Proceedings of the 2017 IEEE International Conference on Big Data (Big Data)},
  title={Jointly optimizing task granularity and concurrency for in-memory mapreduce frameworks},
  year={2017},
  volume={},
  number={},
  pages={130-140},
  doi={10.1109/BigData.2017.8257921}}

License