dbt-databricks: A repository from rxhl

dbt with Databricks

In this tutorial, we will create transformations using dbt for data stored in Databricks. Read the full article here.

0. Prereqs

Make sure you have a Databricks account and a cluster up and running.

1. Local dbt setup

# 1. Clone the repo

# 2. Install Pipenv
$ pip3 install -u pipenv

# 3. Update the python version in the Pipfile per your system and run
$ pipenv --python <YOUR_LOCAL_PYTHON_VERSION>

# 4. Install deps
$ pipenv install

# 5. Activate pipenv
$ pipenv shell

# 6. Confirm your python version
(dbt-databricks) $ python --version

# 7. Confirm your dbt version
(dbt-databricks) $ dbt --version

2. Prepare seed data in Databricks

To create some real transformations, we need to provide seed (raw) data to dbt. We'll manually create a few raw tables in Databricks using scripts available in the databricks_scripts directory. You can run these scripts directly on a Databricks notebook that is attached to an active cluster.

3. Create dbt models

Once the seed data has been created in Databricks, we can start creating some dbt models -- zzz_game_details.sql and zzz_win_loss_records.sql. You can then execute these models by,

# 1. cd into the dbt project
(dbt-databricks) $ cd my_dbt_demo

# 2. Run the models
(dbt-databricks) $ dbt run --model models/zzz_game_details.sql models/zzz_win_loss_records.sql

This would create two new tables in your Databricks cluster.

4. Generate dbt docs

(dbt-databricks) $ dbt docs generate
(dbt-databricks) $ dbt docs serve

# This would open a tab on localhost:8080 where you can view the autogenerated docs.

5. Test models

(dbt-databricks) $ dbt test

(optional) Create a new dbt project from scratch

# 1. From the virtual env, run
$ dbt init my_new_dbt_project

# 2. Fill out your databricks credentials as asked

# 3. cd into your project and test connection
$ cd my_new_dbt_project
$ dbt debug

# 4. Create tables
dbt run

Reference

dbt with Databricks

rxhl/dbt-databricks

dbt with Databricks

0. Prereqs

1. Local dbt setup

2. Prepare seed data in Databricks

3. Create dbt models

4. Generate dbt docs

5. Test models

(optional) Create a new dbt project from scratch

Reference