Spark on Azure Kubernetes Service

Build Status

Status

File/folder	Description
`.github`	Github specific configuration
`.gitignore`	Define what to ignore at commit time.
`aks-spark-chart`	Helm Charts
`benchmark`	Benchmark test code
`docs`	Project documentation
`env`	Terraform to build environment
`results`	Benchmark results images
`spark`	Spark Docker containers and config
`CODE_OF_CONDUCT.md`	Code of Conduct for this project
`CONTRIBUTING.md`	Guidelines for contributing to the sample.
`CHANGELOG.md`	List of changes to the sample.
`LICENSE`	The license for the sample.
`README.md`	This README file.
`SECURITY.md`	This SECURITY file.
`SUPPORT.md`	The SUPPORT policy for this project file.

Prerequisites

This project requires the user to have access to the following:

An Azure AAD Tenant and the ability to create AAD Applications
An Azure Subscription

This project also requires a development environment with the following tools installed

TPC-DS Benchmark toolkit

TPC-DS is an industry-standard benchmark developed by the Transaction Processing Performance Council (TPC). It is used to measure the performance of decision support solutions. The benchmark specification and provided tools may be accessed at www.tpc.org.

This project implements a derivative of TPC-DS benchmark executed using Databricks sql perf libraries. In this derivative benchmark, we evaluated and measured the performance of Spark SQL on Azure Kubernetes (AKS). Our tests was limited to q64-v2.4, q70-v2.4, q82-v2.4 queries.

Running the sample

Follow the steps described in the quick start guide to setup and run the benchmark

Enviroment setup

Kubernetes Node pools

Benchmark test was executed on 2 different types of Node sizes.

Node size	Node count	OS disk size	OS disk type
Standard_DS13_v2	5	256	Ephemeral
Standard_DS13_v2	5	256	Premium
Standard_L8s_v2	5	256	NVMe

Spark parameters

The following sparkConfig was used for this benchmark.

sparkConfig	Value
spark.driver.cores	4
spark.driver.memory	16000m
spark.driver.memoryOverhead	2000m
spark.executor.cores	4
spark.executor.memory	16000m
spark.executor.memoryOverhead	2000m

Serializer	Value	Default
spark.serializer	org.apache.spark.serializer.KryoSerializer	Java serialization

Additional parameters are documented in this SparkApplication yaml.

Results

Please note that these are unaudited results and as such are not comparable with any officially published TPC-DS results.

In total, 10 iterations of the query have been executed and median execution time was recorded.

Execution time (in seconds) of q64 with Ephemeral, Premium and NVMe disk on D and L series VMs

Execution time(in seconds) of q82, q70 with Ephemeral vs Premium OS disk

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Credits

Many thanks to @juan-lee and @alexeldeib for reviewing the AKS and NVMe setup.
Thanks to @alokjain-01 for looking into Spark parameters

Azure-Samples/spark-on-aks-benchmark