Welcome to the Inherited Project Refactoring workshop of Coalesce 2022!
You've just started working at a new job and they've been using dbt to transform their data (YES!). However, once you've made it into their project, you realize that you can't make heads or tails of their data flows. Let's be honest - it's a messy project and you desperately want to form a plan for cleaning it up.
dbt's already got some great material out there regarding refactoring (if you want to start with hands-on-code refactoring, you can start with the refactoring course), but the road to get from where you are to where you need to be can be hard navigate unless you've traveled it numerous times. This refactoring workshop focuses on the roadmap for refactoring, and is the perfect start to getting that extra practice in!
Part of Christine and Lauren's day-to-day is taming those crazy projects - they've created this workshop to teach you the art of planning, show you some tips and tricks, and give you some leveling-up advice for wrangling those DAGs!
Skillset
This workshop assumes that you're familiar with dbt. At a minimum you should know how to:
- Generate documentation
- Run commands and compile code
- Create branches
Resources
-
A Repository
Ideally, with the files and folders contained in this workshop. To make a copy, fork this repository. -
dbt
Using dbt Cloud vs. dbt Core doesn't matter. You'll specifically want to know how to:- install packages
- generate and view documentation
- use selection syntax
- upgrade your dbt version, if needed (This project uses v1.3)
To setup dbt:
-
Some Data
This project is written on top of BigQuery and uses the publicly available TPC-H data set. A truncated version of the data set has been included in this project as CSV files, located in the _resources folder.If you don't have some data or a warehouse yet, don't worry - the setup will guide you through setting up a free BigQuery account and loading the data for this project. Here are some resources to reference, just in case:
- Instructions for setting up a free BigQuery account
- Instructions for loading CSV files into BigQuery
- Starter instructions for accessing the TPC-H dataset yourself
Note:
We don't suggest seeding the CSV files. Though they are truncated, they still contain a significant amount of rows.
-
Set up your dbt Project
Important
If you don't set up the BigQuery account and want to use another warehouse:- You'll need a warehouse - the warehouse is an essential connection in dbt.
- You'll need to load the data to your selected warehouse using another method.
- You'll need to make changes to the repository code you forked so the syntax works with your warehouse.
-
Load the data
Download the files from the
_resources/tpch_dataset
. If you are working locally, the files will be within the repository location on your computer.-
If you set up a BigQuery account during setup, load the data:
- In the BigQuery UI's
Explorer
pane, click the three dots next to your project name - Click
Create dataset
. - For
Dataset ID
, typeraw_tpch
. - Click
Create dataset
- You should now see your dataset listed under your project name. Click the three dots next to the dataset.
- Click
Create table
- Choose
Upload
as the Create table from option. - Click
Browse
underSelect file
- Upload each file you downloaded from the
_resources/tpch_dataset
folder:- For the table name, use the file name without the extension. Some file names have
_100mb
appended. Omit this. - Make sure to check
Auto detect
under Schema
- For the table name, use the file name without the extension. Some file names have
- In the BigQuery UI's
-
If you didn't setup BigQuery, load the data from the
_resources/tpch_dataset
into your warehouse.
You will need to update the_sources.yml
file with the location of your data.
-
-
Run
dbt deps
to install dependencies. -
Confirm your setup:
Try running the following commands:$ dbt run $ dbt test
or alternatively:
$ dbt build
Whoa... whoa there! You can't just go slinging at the DAG like that. Here's a walkthrough to get you trained up!
- Learn more about dbt in the docs
- Check out Discourse for commonly asked questions and answers
- Join the dbt community to learn from other analytics engineers
- Find dbt events near you
- Check out the blog for the latest news on dbt's development and best practices