CI/CD with Databricks

Continuous Integration and Continuous Deployment (CI/CD) is a process used to deliver software applications to end-users in a faster, efficient, and reliable manner. CI/CD automates the building, testing, and deployment process, which reduces the time between committing the code changes and releasing it to production.

Databricks is a cloud-based data engineering and analytics platform that provides a collaborative environment for building, managing, and deploying data pipelines, machine learning models, and analytics applications. It also provides an API to programmatically interact with the Databricks workspace.

In this repository, we'll discuss how to set up CI/CD pipeline with Databricks using GitHub Actions, Databricks CLI, and Databricks REST API.

Prerequisites

  • A GitHub account.
  • A Databricks account with an active workspace.
  • Databricks CLI installed on your local machine.
  • An Azure Blob storage account to store build artifacts (Optional).

Setup

Create a Databricks token

  • Go to your Databricks workspace.
  • Click on the user icon in the top-right corner and select "User Settings".
  • Click on the "Access Tokens" tab.
  • Click on the "Generate New Token" button.
  • Give a name to the token, select the "Manage" permission, and click on the "Generate" button.
  • Copy the generated token.

Setup github configuration with Databricks

  • Navigate to Github --> Settings --> Developer Settings --> Personal access tokens
  • Generate new token (classic) --> Enter password --> Enter note
  • Select repo and workflow scope --> Submit --> Copy token
  • Login to Databricks --> User Settings --> Git Integration
  • Git provider (Github) --> Git provider username or email (your git credentials) --> Token

Configure github actions (CICD)

Configure secrets

  • Go to Settings--> Secrets and Variables
    • Actions
    • DATABRICKS_HOST_PRD: https://< databricks_workspace_name >.com/
    • DATABRICKS_TOKEN_PRD: < personal access token generated in databricks >

Action time!

  • Go to Databricks Repos and clone your repository.
  • Checkout branch- feature/< username >
  • There are few code level changes required before one raises pull request.
  • Update the bronze unit test- test_load_data_into_bronze and set expected_num_files = 2
  • Update the silver unit tests and add a new assertion. You are free to write any test case.
  • Review the integration_suite test which uses Files in Repos feature and fill in the missing elements.
    • src/main/python/gold/gold_layer_etl.py
    • src/main/tests/integration_suite/test_integration_gold_layer_etl
  • Review the dbx deployment file
    • Open cicd_with_databricks/deployment/deploy-job.yaml and update notebook_path variable to your Databricks Repos location.
  • Review files under .github/workflow/ to understand the CICD plan.
  • Create a pull request and view the CICD unit testing job that spins up in github → actions.
  • Once it succeeds, merge the pull request into develop branch and view the CICD integration testing job that spins up.
  • Once integration tests are completed on develop branch, raise a PR from develop branch into main. View the CICD job that spins up, runs unit & integration tests.
  • Once it succeeds, merge the pull request into develop branch and view the CICD job that creates Databricks workflow jobs and launches them.