dbt-labs/dbt-core

[Feature] No more jinja block for snapshots - new snapshot design ideas

graciegoheen opened this issue · 5 comments

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Current State

To configure a snapshot currently, you must nest your configuration and SQL within a snapshot jinja block like so:

{% snapshot orders_snapshot %}

{{
    config(
      target_database='analytics',
      target_schema='snapshots',
      unique_key='id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

select * from {{ source('jaffle_shop', 'orders') }}

{% endsnapshot %}

Why? (you might ask)

The story begins…

Snapshots are a really ancient dbt feature -- implemented as dbt archive in #183 and first released in 0.5.1, just two days shy of dbt's 6 month anniversary.

There were no snapshot blocks and  snapshots/*.sql files in these early days.

Instead, they were originally declared within dbt_project.yml like this:

archive:
    - source_schema: synced_production_data  # schema to look for tables in (declared below)
      target_schema: dbt_archive             # where to archive the data to
      tables:
        - source_table: some_other_table
           target_table: some_other_table_archive
           updated_at: "updatedAt"
           unique_key: "id"

A glow up

#1175 and #1361 allowed snapshots to escape YAML Land and become:

select statements, defined within a snapshot block in a .sql file

{% snapshot your_snapshot_name_here %}

{{ config(
          target_database='<optional database name>',
          target_schema='<schema name>',
          target_table='<table name>',
          strategy='check',
          unique_key='id',
          check_cols=['object_status', 'object_name'],
) }}

-- your select statement here

{% endsnapshot %}

At the time the thought was, “we should/will reimplement all the resources like this” (so that you could define multiple “model blocks” in a single file).

Turns out that defining multiple resources in one file makes

  • parsing slower
  • IDE interactive compile/preview much harder

and so in the leadup to v1.0, it wasn’t a priority to do this rework — and finally decided it wasn’t really even desirable.

Future State Ideas

...but what is desirable?

Option 1: Snapshots are just yml configs, they contain no logic (like exposures, sources, tests, etc.)

# snapshots/my_snapshots.yml
snapshots:
  - name: orders_snapshot
    config:
      schema: snapshots
      database: analytics
    source: source('jaffle_shop', 'orders') # could also be a model:
    unique_key: id
    strategy: timestamp
    updated_at: updated_at
  • transformation is just select * from {{ source('jaffle_shop', 'orders') }}
    • best practice is “staging” layer on top of your snapshots for cleanup (could even be an ephemeral model if you don’t want clutter in your warehouse)
    • exceptions where you’d want to do light transformation on your source, then snapshot (or bake logic into snapshot):
      • types of logic: filters, deduplication, surrogate → unique key, etc.
      • best practice → snapshot an ephemeral model that contains the logic
      • This method also helps with development because you’ll be able to run the query rather than running a snapshot every time
  • “snapshots are a config” on a model (like tests, etc.)
  • [potential] snapshots could be defined in yml in models folder (like sources) to allow for better organization (currently silo-d in snapshots folder)

Option 2: Snapshots are just models, they are a materialization (like incremental, view, table, etc.)

# models/my_snapshot.sql

{{
    config(
      target_database='analytics',
      target_schema='snapshots',
      unique_key='id',
      strategy='timestamp',
      updated_at='updated_at',
      materialized='snapshot'
    )
}}

select * from {{ source('jaffle_shop', 'orders') }}
  • if snapshots are just models, would they now be included in dbt run? vs. dbt snapshot?
  • how would we handle snapshot-paths? would you be able to put them in your models folder? or only snapshots folder?
  • what would happen to their resource_type - model or snapshot (incremental is model)? if we went with model, that would be an id change, adjustment to selector syntax for dbt build, DBT_EXCLUDE_RESOURCE_TYPE wouldn’t work for excluding snapshots, etc.
  • this would be the only (?) model materialization where dbt injects its own meta-fields into the final table
  • how would a migration path look like from one materialization to another? table -> snapshot? incremental -> snapshot? etc?
  • materialized='snapshot' or materialized='scd'

Option 3: Snapshots are just sql files in the snapshots folder, but they don’t use jinja blocks (one .sql file per snapshot)

# snapshots/my_snapshot.sql

{{
    config(
      target_database='analytics',
      target_schema='snapshots',
      unique_key='id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

select * from {{ source('jaffle_shop', 'orders') }}
  • snapshot name is same as .sql file name, just like models

Which is best?

  • I think the question between the options above comes down to:
    • how strongly do we believe in the “snapshots should only be select *” best practice
    • how much do we care about migration paths
    • what other things would we need to change (commands, configs, etc.)
    • what’s the technical lift for each of these (can we map the new → old)
  Option 1: Snapshots are just yml configs Option 2: Snapshots are just models Option 3: Snapshots are just sql files
enforce best practice that snapshot is just select * X    
can have SQL logic being snapshotted X (snapshot ephemeral model with logic) X X
easy migration path for users (does this matter if we keep prior art for backwards compatibility?)   X X
mapping new to old (define multiple snapshots in 1 file, etc.) X    
minimal consequences of other things we’d need to change / consider (commands, configs, etc.) X   X
define your snapshots in your models directory X X

Notes

  • we will keep the prior art, for backwards compatibility
  • should snapshots respect the generate_schema_name and generate_database_name macros?
    • remove target_database and target_schema
  • should clone work for snapshots?
  • how do we provide a way to test your snapshot configuration before production-izing your code?

Related issues

#4761
#9033

Personally I like option 2! As far as the questions on that one, here are my thoughts:

  • if snapshots are just models, would they now be included in dbt run? vs. dbt snapshot?
    No, they still function like snapshots because the nature of snapshots needs to stay intentional. With an incremental (as long as you set it up correctly), you are only adding the delta or updating, which will happen at some point or another anyways. However, the use of a snapshot is literally just like a picture! Folks tend to want the "clean" version many more times than they want the blurry shots (i.e, once a day).

  • how would we handle snapshot-paths? would you be able to put them in your models folder? or only snapshots folder?
    I have had customers who had a desire to configure snapshot models within the models folder. I personally don't like the thought of too much flexibility on that - i.e, you could have snapshot sql files right next to your every-day any-time models. However, I do like the idea of having them next to the models in a defined way like models > snapshots or even models > subfolder > snapshots. Is it possible for this to be flexible yet restrictive to a folder called 'snapshots', somewhat like we do with tests > generic?

  • what would happen to their resource_type - model or snapshot (incremental is model)? if we went with model, that would be an id change, adjustment to selector syntax for dbt build, DBT_EXCLUDE_RESOURCE_TYPE wouldn’t work for excluding snapshots, etc.
    I think it would still be a snapshot because of what I said for bullet 1. Snapshots are actually "building" a source of data that otherwise wouldn't exist, where with an incremental (even though you can configure it like a snapshot), that use case is generally more forgiving as the data we recommend already has history that can always be referred back to. I think it makes sense that we stay somewhat protective over why we don't define it as any old model.