TeraSort is a popular benchmark that measures the amount of time to sort one terabyte of randomly distributed data (or any other amount of data you want) on a given cluster. It is originally written to measure MapReduce performance of an Apache™ Hadoop® cluster. In this project, the code is re-written in Scala to measure the performance of Spark cluster. It is a benchmark that combines testing the Storage layer(HDFS) and Computation layers(YARN / Spark) of an Hadoop cluster.
A full TeraSort benchmark run consists of the following three steps:
- Generating the input data via TeraGen.
- Running the actual TeraSort on the input data.
- Validating the sorted output data via TeraValidate.
You do not need to re-generate input data before every TeraSort run (step 2). So you can skip step 1 (TeraGen) for later TeraSort runs if you are satisfied with the generated data.
$ sbt assembly
For each spark job metrics will be printed in the logs, you can redirect the logs to store in file and use it to compare. You may provide the additional configurations like spark.executor.memory, spark.driver.memory, spark.executor.instances etc.
Usage:
$ spark-submit --class com.lbattini.spark.terasort.TeraGen --deploy-mode cluster --master yarn spark-terasort-0.1.jar [output-size] [output-directory]
Example:
$ spark-submit --class com.lbattini.spark.terasort.TeraGen --deploy-mode cluster --master yarn spark-terasort-0.1.jar 1G /benchmarks/terasort/tera_input
Usage:
$ spark-submit --class com.lbattini.spark.terasort.TeraSort --deploy-mode cluster --master yarn spark-terasort-0.1.jar [input-directory] [output-directory]
Example:
$ spark-submit --class com.lbattini.spark.terasort.TeraSort --deploy-mode cluster --master yarn spark-terasort-0.1.jar /benchmarks/terasort/tera_input /benchmarks/terasort/tera_output
Usage:
$ spark-submit --class com.lbattini.spark.terasort.TeraValidate --deploy-mode cluster --master yarn spark-terasort-0.1.jar [input-directory]
Example:
$ spark-submit --class com.lbattini.spark.terasort.TeraValidate --deploy-mode cluster --master yarn spark-terasort-0.1.jar /benchmarks/terasort/tera_output /benchmarks/terasort/tera_validate
Replace --master yarn with --master "k8s master" and provide below configurations: spark.kubernets.container.image spark.kubernetes.driver.pod.name
This code is built based on https://github.com/ehiggs/spark/tree/terasort
, Ewan Higgs' terasort example. Thank you for great example.