/spark-databricks-observability-demo

Monitoring Databricks using Prometheus, Grafana and Pyroscope

Primary LanguageHCLMIT LicenseMIT

Contributors Forks Stargazers Issues MIT License LinkedIn


Databricks Spark Observability Demo

Monitoring and profiling Spark applications in Databricks with Prometheus, Grafana and Pyroscope

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact
  8. Acknowledgments

About The Project

Dive deeply into performance details and uncover what Spark Execution Plan doesn't typically show.

Product Name Screen Shot

(back to top)

Built With

Databricks Prometheus Grafana Pyroscope Spark

(back to top)

Getting Started

This project demonstrates how to monitor and profile Spark applications in Databricks using Prometheus, Grafana and Pyroscope. This is applicable to any Spark application running on Databricks, including batch, streaming, and interactive workloads (including ephemeral Jobs).

Besides Prometheus, Pyroscope and Grafana, this project will create a small single-node Spark Cluster and a set of init scripts to configure it to push metrics to Prometheus Pushgateway and Pyroscope.

High-level architecture

       ┌─────────┐                                                                         
       │ Grafana │                                                                         
       └────▲────┘                                                                         
            │                           ┌────────────────┐                                 
            │                           │   Databricks   │                                 
   ┌────────┴────────┐                  │  Spark Cluster │                                 
   │   Prometheus    │                  │                │                                 
   └────────▲────────┘                  │                │                                 
            │                           │ ┌────────────┐ │                                 
            │                       ┌───┼─┤ Driver     ├─┼───┐                             
            │                       │   │ └────────────┘ │   │                             
            │                       │   │                │   │                             
┌───────────┴────────────┐  Metrics ▼   │ ┌────────────┐ │   ▼   APM Traces   ┌───────────┐
│ Prometheus Pushgateway │◄─────────────┼─┤ Executor   ├─┼───────────────────►│ Pyroscope │
└────────────────────────┘          ▲   │ └────────────┘ │   ▲                └───────────┘
                                    │   │                │   │                             
                                    │   │ ┌────────────┐ │   │                             
                                    └───┼─┤ Executor   ├─┼───┘                             
                                        │ └────────────┘ │                                 
                                        │                │                                 
                                        └────────────────┘                                 

Prerequisites

This demo uses Terraform to create all necessary resources in your Databricks Workspace. You will need Terraform version 1.40 or later installed on your machine.

You'll also need a VM with the network connectivity to the Databricks Workspace. This VM should preferably be created in the same virtual network as the Databricks Workspace, or the peered network.

Databricks

You will need a Databricks account to run the demo if you don't have one already. You can sign up for a free account at https://databricks.com/try-databricks.

Tooling

In order to send metrics and traces to Prometheus and Pyroscope, they need to be set up and running. For the convenience of the demo, the complete setup is done using Docker Compose, which you can find in docker directory. The included Terraform configuration won't create these resources for you, so you will need to set them up.

It can be started with the following command:

docker compose up

Setup

You will need a Databricks Personal Access Token to run the demo. Once you have the token, you can create a profile in the Databricks CLI or configure the provider explicitly (using PAT or any other form of authentication).

(back to top)

Usage

Terraform setup has only two variables that need to be set, we can provide them through Environment (or through a file), making sure to replace the values with the actual ones:

export TF_VAR_prometheus_pushgateway_host={pushgateway_host}:9091
export TF_VAR_pyroscope_host={prometheus_host}:4040

Prometheus Demo

If configured, you'll be able to see all relevant metrics in Grafana. If you're using tagging, you are also able to filter by cluster, job, and other tags.

The example below shows how to configure the basic dashboard to show job metrics over time.

Prometheus Demo

Pyroscope Demo

If set correctly, here's what you should get at the end. The following example demonstrates profiling a Spark application that is bottlenecked by reading lzw compressed files, as well as using regex to process the data.

Pyroscope Demo

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

Project Link: https://github.com/rayalex/spark-databricks-observability-demo

(back to top)

Thanks to

Special thanks to these, as without them this would not be possible:

(back to top)