taxi_rides_ny_duckdb: A repository from bruceewue

Maximizing Confidence in Your Data Model Changes with dbt and PipeRider

This project was created to accompany the PipeRider + dbt workshop on improving your code review for dbt projects.

This workshop project will run you through the following steps:

PipeRider Walkthrough

Initialize PipeRider inside a dbt project
Run PipeRider to create a data report
Compare data reports
Use a compare recipe
Define dbt metrics and view in report
Compare dbt metrics

Prequisites

Ideally, you have completed the Week 4 module on Analytics Engineering of the DataTalksClub Data Engineering Zoomcamp
A basic understanding of dbt
Install, or update to, DuckDB 0.7.0

Workshop Steps

1. Initial setup

Fork this repo

Clone your forked repo

git clone <your-repo-url>
cd taxi_rides_ny_duckdb

Download the DuckDB database file

wget https://dtc-workshop.s3.ap-northeast-1.amazonaws.com/nyc_taxi.duckdb

Set up a new venv

python -m venv ./venv
source ./venv/bin/activate

Update pip and install the neccessary dbt packages and PipeRider

pip install -U pip
pip install dbt-core dbt-duckdb 'piperider[duckdb]'

Create a new branch to work on
```
git switch -c data-modeling
```
Install dbt deps and build dbt models
```
dbt deps
dbt build
```
Initialize PipeRider
```
piperider init
```
Check PipeRider settings
```
piperider diagnose
```

2. Run PipeRider and data model changes

Run PipeRider
```
piperider run
```
PipeRider will profile the database and output the path to your data report, e.g.
```
Generating reports from: /project/path/.piperider/outputs/latest/run.json
Report generated in /project/path/.piperider/outputs/latest/index.html
```
View the HTML report to see the full statistical report of your data source.

Make data model changes (move statistics to their own model)

a. Create a new model models/core/dm_monthly_zone_statistics.sql

{{ config(materialized='table') }}

with trips_data as (
select * from {{ ref('fact_trips') }}
)
select
-- Reveneue grouping
pickup_zone as revenue_zone,
date_trunc('month', pickup_datetime) as revenue_month,
--Note: For BQ use instead: date_trunc(pickup_datetime, month) as revenue_month,

service_type,

-- Additional calculations
count(tripid) as total_monthly_trips,
avg(passenger_count) as avg_montly_passenger_count,
avg(trip_distance) as avg_montly_trip_distance

from trips_data
group by 1,2,3

b. Comment out lines 26-28 of models/core/dm_monthly_zone_revenue.sql

-- Additional calculations
-- count(tripid) as total_monthly_trips,
-- avg(passenger_count) as avg_montly_passenger_count,
-- avg(trip_distance) as avg_montly_trip_distance

Rebuild the dbt models
```
dbt build
```
Run PipeRider again to generate the second data report with the new models
```
piperider run
```
Use the compare-reports function to compare the data profile reports
```
piperider compare-reports --last
```
The compare-reports outputs two files:
- Comparison report: An HTML report comparing the two data profiles
- Comparison summary: A Markdown file with a summary of changes.
The comparison summary markdown is used to insert into a pull request (PR) comment in a later step.

Commit your changes and push your branch

git add .
git commit -m "Added statistics model, updated revenue model"
git push origin datamodeling

Create a pull request.

a. Visit your repo on GitHub and clck Compare & pull request

b. Copy the contents of the comparison summary Markdown file into your pull request comment box

c. Click preview to see how the comparison looks

d. Click Create pull request to submit your changes

3. PipeRider Compare Recipe

In the above example we used the compare-reports command. PipeRider also has a separate compare command that uses the concept of compare 'recipes'. Recipes are a powerful way to define the specifics of how the compare will run, such as:

The branches to compare
The datasource/target to compare
The dbt commands to run prior to the compare

When PipeRider is initialized a default compare recipe is created. For our project this looks like:

base:
  branch: main
  dbt:
    commands:
    - dbt deps
    - dbt build
  piperider:
    command: piperider run
target:
  branch: data-modeling
  dbt:
    commands:
    - dbt deps
    - dbt build
  piperider:
    command: piperider run

Run the following command to run the above recipe:

piperider compare

As per the recipe, PipeRider will automatically do the following:

Check out the main branch
Build the models
Run PipeRider
Check out the data-modeling branch
Build the models
Run PipeRider
Compare the data reports of main and data-modeling
Output the compare report and summary

4. dbt-defined Metrics

PipeRider also supports profiling dbt-defined metrics. PipeRider will query dbt metrics and include them in the HTML report.

Edit models/core/schema.yml and add the following code:

metrics:
  - name: average_distance
    label: Average Distance
    model: ref('fact_trips')
    description: "The average trip distance"

    calculation_method: average
    expression: trip_distance

    timestamp: pickup_datetime
    time_grains: [month, quarter, year]

    tags:
    - piperider

Important: Don't forget the piperider tag, this is how PipeRider is able to find and query your project metrics

This defines a new metric on the fact_trips table that calculates the average trip_distance distance at the time_grains of month, quarter, and year.

Run dbt compile
```
dbt compile
```
Note: As we’re only adding metrics, it is not necessary to build the models again with dbt build.
Run PipeRider to generate a new report.
```
piperider run
```
Check the PipeRider report and click the metrics tab to view the metrics graph.

5. Filter and compare dbt metrics

PipeRider also supports comparing metrics between runs. The comarison is visualized in the Comparison report and included in the comparison summary Markdown.

Edit models/core/schema.yml again and add the following filter to the metric definition:

filters:
  - field: pickup_borough
    operator: '='
    value: "'Manhattan'"
  - field: dropoff_borough
    operator: '='
    value: "'Manhattan'"

The filter will modify the metric to only apply to rows that meet the defined conditions - In this case, that the pickup and dropoff borough should be Manhattan.

Your modified metric definition should now look like this:

metrics:
  - name: average_distance
    label: Average Distance
    model: ref('fact_trips')
    description: "The average trip distance"

    calculation_method: average
    expression: trip_distance

    timestamp: pickup_datetime
    time_grains: [month, quarter, year]

    filters:
      - field: pickup_borough
        operator: '='
        value: "'Manhattan'"
      - field: dropoff_borough
        operator: '='
        value: "'Manhattan'"

    tags:
    - piperider

Compile your dbt project again.
```
dbt compile
```
Run PipeRider again to generate a report with the new, filttered, metrics.
```
piperider run
```
Lastly, run the PipeRider compare-reports command to create a comparison report that will include the two, differently defined, metrics.
```
piperider compare-reports --last
```
View the newly generated comparison report to see how the metrics compare.
The comparison summary also contains a summary of the metric differences between reports.

PipeRider resources:

Learn more about PipeRider in the docs
Visit the PipeRider homepage
Join the PipeRider Discord for help and discussion
Read the PipeRider blog for articles about PipeRider

bruceewue/taxi_rides_ny_duckdb