Maximizing Confidence in Your Data Model Changes with dbt and PipeRider
This project was created to accompany the PipeRider + dbt workshop on improving your code review for dbt projects.
This workshop project will run you through the following steps:
PipeRider Walkthrough
- Initialize PipeRider inside a dbt project
- Run PipeRider to create a data report
- Compare data reports
- Use a compare recipe
- Define dbt metrics and view in report
- Compare dbt metrics
Prequisites
- Ideally, you have completed the Week 4 module on Analytics Engineering of the DataTalksClub Data Engineering Zoomcamp
- A basic understanding of dbt
- Install, or update to, DuckDB 0.7.0
Workshop Steps
1. Initial setup
-
Fork this repo
-
Clone your forked repo
git clone <your-repo-url> cd taxi_rides_ny_duckdb
-
Download the DuckDB database file
wget https://dtc-workshop.s3.ap-northeast-1.amazonaws.com/nyc_taxi.duckdb
-
Set up a new venv
python -m venv ./venv source ./venv/bin/activate
-
Update pip and install the neccessary dbt packages and PipeRider
pip install -U pip pip install dbt-core dbt-duckdb 'piperider[duckdb]'
-
Create a new branch to work on
git switch -c data-modeling
-
Install dbt deps and build dbt models
dbt deps dbt build
-
Initialize PipeRider
piperider init
-
Check PipeRider settings
piperider diagnose
2. Run PipeRider and data model changes
-
Run PipeRider
piperider run
PipeRider will profile the database and output the path to your data report, e.g.
Generating reports from: /project/path/.piperider/outputs/latest/run.json Report generated in /project/path/.piperider/outputs/latest/index.html
View the HTML report to see the full statistical report of your data source.
-
Make data model changes (move statistics to their own model)
a. Create a new model
models/core/dm_monthly_zone_statistics.sql
{{ config(materialized='table') }} with trips_data as ( select * from {{ ref('fact_trips') }} ) select -- Reveneue grouping pickup_zone as revenue_zone, date_trunc('month', pickup_datetime) as revenue_month, --Note: For BQ use instead: date_trunc(pickup_datetime, month) as revenue_month, service_type, -- Additional calculations count(tripid) as total_monthly_trips, avg(passenger_count) as avg_montly_passenger_count, avg(trip_distance) as avg_montly_trip_distance from trips_data group by 1,2,3
b. Comment out lines 26-28 of
models/core/dm_monthly_zone_revenue.sql
-- Additional calculations -- count(tripid) as total_monthly_trips, -- avg(passenger_count) as avg_montly_passenger_count, -- avg(trip_distance) as avg_montly_trip_distance
-
Rebuild the dbt models
dbt build
-
Run PipeRider again to generate the second data report with the new models
piperider run
-
Use the
compare-reports
function to compare the data profile reportspiperider compare-reports --last
The
compare-reports
outputs two files:- Comparison report: An HTML report comparing the two data profiles
- Comparison summary: A Markdown file with a summary of changes.
The comparison summary markdown is used to insert into a pull request (PR) comment in a later step.
-
Commit your changes and push your branch
git add . git commit -m "Added statistics model, updated revenue model" git push origin datamodeling
-
Create a pull request.
a. Visit your repo on GitHub and clck
Compare & pull request
b. Copy the contents of the comparison summary Markdown file into your pull request comment box
c. Click
preview
to see how the comparison looksd. Click
Create pull request
to submit your changes
3. PipeRider Compare Recipe
In the above example we used the compare-reports
command. PipeRider also has a separate compare
command that uses the concept of compare 'recipes'. Recipes are a powerful way to define the specifics of how the compare will run, such as:
- The branches to compare
- The datasource/target to compare
- The dbt commands to run prior to the compare
When PipeRider is initialized a default compare recipe is created. For our project this looks like:
base:
branch: main
dbt:
commands:
- dbt deps
- dbt build
piperider:
command: piperider run
target:
branch: data-modeling
dbt:
commands:
- dbt deps
- dbt build
piperider:
command: piperider run
Run the following command to run the above recipe:
piperider compare
As per the recipe, PipeRider will automatically do the following:
- Check out the
main
branch - Build the models
- Run PipeRider
- Check out the
data-modeling
branch - Build the models
- Run PipeRider
- Compare the data reports of
main
anddata-modeling
- Output the compare report and summary
4. dbt-defined Metrics
PipeRider also supports profiling dbt-defined metrics. PipeRider will query dbt metrics and include them in the HTML report.
-
Edit
models/core/schema.yml
and add the following code:metrics: - name: average_distance label: Average Distance model: ref('fact_trips') description: "The average trip distance" calculation_method: average expression: trip_distance timestamp: pickup_datetime time_grains: [month, quarter, year] tags: - piperider
Important: Don't forget the
piperider
tag, this is how PipeRider is able to find and query your project metricsThis defines a new metric on the
fact_trips
table that calculates the averagetrip_distance
distance at thetime_grains
of month, quarter, and year. -
Run dbt compile
dbt compile
Note: As we’re only adding metrics, it is not necessary to build the models again with
dbt build
. -
Run PipeRider to generate a new report.
piperider run
-
Check the PipeRider report and click the
metrics
tab to view the metrics graph.
5. Filter and compare dbt metrics
PipeRider also supports comparing metrics between runs. The comarison is visualized in the Comparison report and included in the comparison summary Markdown.
-
Edit
models/core/schema.yml
again and add the followingfilter
to the metric definition:filters: - field: pickup_borough operator: '=' value: "'Manhattan'" - field: dropoff_borough operator: '=' value: "'Manhattan'"
The filter will modify the metric to only apply to rows that meet the defined conditions - In this case, that the pickup and dropoff borough should be Manhattan.
Your modified metric definition should now look like this:
metrics: - name: average_distance label: Average Distance model: ref('fact_trips') description: "The average trip distance" calculation_method: average expression: trip_distance timestamp: pickup_datetime time_grains: [month, quarter, year] filters: - field: pickup_borough operator: '=' value: "'Manhattan'" - field: dropoff_borough operator: '=' value: "'Manhattan'" tags: - piperider
-
Compile your dbt project again.
dbt compile
-
Run PipeRider again to generate a report with the new, filttered, metrics.
piperider run
-
Lastly, run the PipeRider
compare-reports
command to create a comparison report that will include the two, differently defined, metrics.piperider compare-reports --last
-
View the newly generated comparison report to see how the metrics compare.
-
The comparison summary also contains a summary of the metric differences between reports.
PipeRider resources:
- Learn more about PipeRider in the docs
- Visit the PipeRider homepage
- Join the PipeRider Discord for help and discussion
- Read the PipeRider blog for articles about PipeRider