/spark-on-aks-benchmark

Testing of Spark on AKS

Primary LanguageScalaMIT LicenseMIT

Spark on Azure Kubernetes Service

Build Status

Status
Terraform

Contents

File/folder Description
.github Github specific configuration
.gitignore Define what to ignore at commit time.
aks-spark-chart Helm Charts
benchmark Benchmark test code
docs Project documentation
env Terraform to build environment
results Benchmark results images
spark Spark Docker containers and config
CODE_OF_CONDUCT.md Code of Conduct for this project
CONTRIBUTING.md Guidelines for contributing to the sample.
CHANGELOG.md List of changes to the sample.
LICENSE The license for the sample.
README.md This README file.
SECURITY.md This SECURITY file.
SUPPORT.md The SUPPORT policy for this project file.

Prerequisites

This project requires the user to have access to the following:

  • An Azure AAD Tenant and the ability to create AAD Applications
  • An Azure Subscription

This project also requires a development environment with the following tools installed

TPC-DS Benchmark toolkit

TPC-DS is an industry-standard benchmark developed by the Transaction Processing Performance Council (TPC). It is used to measure the performance of decision support solutions. The benchmark specification and provided tools may be accessed at www.tpc.org.

This project implements a derivative of TPC-DS benchmark executed using Databricks sql perf libraries. In this derivative benchmark, we evaluated and measured the performance of Spark SQL on Azure Kubernetes (AKS). Our tests was limited to q64-v2.4, q70-v2.4, q82-v2.4 queries.

Running the sample

Follow the steps described in the quick start guide to setup and run the benchmark

Enviroment setup

Kubernetes Node pools

Benchmark test was executed on 2 different types of Node sizes.

Node size Node count OS disk size OS disk type
Standard_DS13_v2 5 256 Ephemeral
Standard_DS13_v2 5 256 Premium
Standard_L8s_v2 5 256 NVMe

Spark parameters

The following sparkConfig was used for this benchmark.

sparkConfig Value
spark.driver.cores 4
spark.driver.memory 16000m
spark.driver.memoryOverhead 2000m
spark.executor.cores 4
spark.executor.memory 16000m
spark.executor.memoryOverhead 2000m
Serializer Value Default
spark.serializer org.apache.spark.serializer.KryoSerializer Java serialization

Additional parameters are documented in this SparkApplication yaml.

Results

Please note that these are unaudited results and as such are not comparable with any officially published TPC-DS results.

In total, 10 iterations of the query have been executed and median execution time was recorded.

  • Execution time (in seconds) of q64 with Ephemeral, Premium and NVMe disk on D and L series VMs

q64 results

  • Execution time(in seconds) of q82, q70 with Ephemeral vs Premium OS disk

q64 results

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Credits