/snapshot-slowly-changing-dimensions-dbt-example

Example of how to use Snapshots in DBT

GNU General Public License v3.0GPL-3.0

snapshot-slowly-changing-dimensions-dbt-example

This repo shows how to maintain "type-2 Slowly Changing Dimensions" snapshots of a mutable table using DBT. What this is used for is to provide a "look back in time" view of a table which is updated in place (i.e. new data replaces old data).

The data warehouse I'm using for this example is google BigQuery.

# install dependencies #
python -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install dbt-core dbt-bigquery
-- create bigquery dataset --
CREATE SCHEMA `your-gcp-project-id.dbt_snapshot_example`
OPTIONS (
  location = 'europe-west2'
)
;

'timestamp' snapshot strategy

The timestamp strategy uses an updated_at field to determine if a row has changed. If the configured updated_at column for a row is more recent than the last time the snapshot ran, then dbt will invalidate the old record and record the new one. If the timestamps are unchanged, then dbt will not take any action.

First, I create a table called users, and insert a single user with status 'active'

CREATE TABLE `your-gcp-project-id.dbt_snapshot_example.users` (
user_name STRING,
user_status STRING,
row_updated_at TIMESTAMP
);

INSERT INTO `your-gcp-project-id.dbt_snapshot_example.users`
(user_name, user_status, row_updated_at)
VALUES ('joe', 'active', CURRENT_TIMESTAMP())
;

SELECT  *
FROM    `your-gcp-project-id.dbt_snapshot_example.users`
;
user_name user_status row_updated_at
joe active 2024-09-01 12:43:36 UTC

I take a snapshot of the users table - the first time the dbt snapshot command is run, it just creates a copy of the users table (with some additional columns):

dbt snapshot --select users_snapshot_timestamp_strategy

Here is how the snapshot table looks:

SELECT  *
FROM    `your-gcp-project-id.dbt_snapshot_example.users_snapshot_timestamp_strategy`
;
user_name user_status row_updated_at dbt_scd_id dbt_updated_at dbt_valid_from dbt_valid_to
joe active 2024-09-01 12:43:36 UTC 7434bb97e2fb34e2c0b481959d70d155 2024-09-01 12:43:36 UTC 2024-09-01 12:43:36 UTC

Now, I change the status of the user 'joe' in the users table to 'dormant' by overwriting the value in the user_status column.

UPDATE  `your-gcp-project-id.dbt_snapshot_example.users`
SET     user_status = 'dormant'
      , row_updated_at = CURRENT_TIMESTAMP()
WHERE   user_name = 'joe'
;

SELECT  *
FROM    `your-gcp-project-id.dbt_snapshot_example.users`
;
user_name user_status row_updated_at
joe dormant 2024-09-01 12:57:10 UTC

Running dbt snapshot again appends a new row to the snapshot table, recording this row change:

dbt snapshot --select users_snapshot_timestamp_strategy
SELECT  *
FROM    `your-gcp-project-id.dbt_snapshot_example.users_snapshot_timestamp_strategy`
;
user_name user_status row_updated_at dbt_scd_id dbt_updated_at dbt_valid_from dbt_valid_to
joe dormant 2024-09-01 12:57:10 UTC 27d047306121e541ed7a8a82b7d7758e 2024-09-01 12:57:10 UTC 2024-09-01 12:57:10 UTC null
joe active 2024-09-01 12:43:36 UTC 7434bb97e2fb34e2c0b481959d70d155 2024-09-01 12:43:36 UTC 2024-09-01 12:43:36 UTC 2024-09-01 12:57:10 UTC

'check' snapshot strategy

The check strategy is useful for tables which do not have a reliable updated_at column. This strategy works by comparing a list of columns between their current and historical values. If any of these columns have changed, then dbt will invalidate the old record and record the new one. If the column values are identical, then dbt will not take any action.

DROP TABLE IF EXISTS `your-gcp-project-id.dbt_snapshot_example.users`
;

CREATE TABLE `your-gcp-project-id.dbt_snapshot_example.users` (
user_name STRING,
user_status STRING,
);

INSERT INTO `your-gcp-project-id.dbt_snapshot_example.users`
(user_name, user_status)
VALUES ('joe', 'active')
;

SELECT  *
FROM    `your-gcp-project-id.dbt_snapshot_example.users`
;
user_name user_status
joe active
dbt snapshot --select users_snapshot_check_strategy
SELECT  *
FROM    `your-gcp-project-id.dbt_snapshot_example.users_snapshot_check_strategy`
;
user_name user_status dbt_scd_id dbt_updated_at dbt_valid_from dbt_valid_to
joe active 5dcf080ad5fc313f0781899e7d9ec442 2024-09-01 13:44:51 UTC 2024-09-01 13:44:51 UTC null
UPDATE `your-gcp-project-id.dbt_snapshot_example.users`
SET   user_status = 'dormant'
WHERE user_name = 'joe'
;

SELECT * FROM `your-gcp-project-id.dbt_snapshot_example.users`
;
user_name user_status
joe dormant
dbt snapshot --select users_snapshot_check_strategy
SELECT  *
FROM    `your-gcp-project-id.dbt_snapshot_example.users_snapshot_check_strategy`
;
user_name user_status dbt_scd_id dbt_updated_at dbt_valid_from dbt_valid_to
joe active 5dcf080ad5fc313f0781899e7d9ec442 2024-09-01 13:44:51 UTC 2024-09-01 13:44:51 UTC 2024-09-01 13:47:43 UTC
joe dormant 3928a06a330872968ed3aa305625b730 2024-09-01 13:47:43 UTC 2024-09-01 13:47:43 UTC null