dbt-labs/dbt-core

[CT-2701] Snapshot dbt_updated_at not updated

Opened this issue · 8 comments

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

This was brought up by a customer and I am kinda surprised I have never heard this before.

Our docs show

two fields in the existing row should be updated when the updated_at timestamp is changed signifying a change of some value, the dbt_valid_to and the dbt_updated_at

image

Looks like only one update is happening and the dbt_updated_at does not get updated.

Screen Shot 2023-06-14 at 2 06 07 PM

Expected Behavior

According to the docs, the older row should have two fields updated:
dbt_valid_to
dbt_updated_at

Steps To Reproduce

  • create snapshot with timestamp check
  • add record 1 as id, '2023-01-01' as updated_at
  • run snapshot
  • modify record 1 as id, '2023-01-02' as updated_at
  • run snapshot
  • check the records in the output and see if dbt_updated_at has been modified in the original row

Relevant log output

No response

Environment

- OS:
- Python:
- dbt: tested 1.1 and 1.4

Which database adapter are you using with dbt?

No response

Additional Context

No response

Thanks for opening this @patkearns10!

See below for the reproducible example I'm using (with dbt-postgres) that shows the same thing that you are reporting.

But the thing I'm wondering: is this actually a problem that needs to be solved in the implementation of snapshots that use the timestamp strategy? Or would updating the documentation suffice?

Reprex

snapshots/my_snapshot.sql

{% snapshot my_snapshot %}

{{
    config(
      target_schema='snapshots',
      unique_key='id',
      strategy='timestamp',
      updated_at='updated_at',
    )
}}

{% if var("my_data_version", 1) == 1 %}

    select 1 as id, '2023-01-01'::timestamp as updated_at

{% else %}

    select 1 as id, '2023-01-02'::timestamp as updated_at

{% endif %}

{% endsnapshot %}

Take the 1st snapshot:

dbt snapshot --vars '{"my_data_version": 1}'
dbt show --inline "select * from {{ ref('my_snapshot') }}"
id updated_at dbt_scd_id dbt_updated_at dbt_valid_from dbt_valid_to
1 2023-01-01 00:00:00 5aec37f35c393a7ef... 2023-01-01 00:00:00 2023-01-01 00:00:00

Take a 2nd snapshot:

dbt snapshot --vars '{"my_data_version": 2}'
dbt show --inline "select * from {{ ref('my_snapshot') }}"
id updated_at dbt_scd_id dbt_updated_at dbt_valid_from dbt_valid_to
1 2023-01-01 00:00:00 5aec37f35c393a7ef... 2023-01-01 00:00:00 2023-01-01 00:00:00 2023-01-02 00:00:00
1 2023-01-02 00:00:00 8ff9567d22a84a2c9... 2023-01-02 00:00:00 2023-01-02 00:00:00

Here's the default implementation, which we can see only updates the dbt_valid_to (but not dbt_updated_at) for rows that are no longer valid:

when matched
and DBT_INTERNAL_DEST.dbt_valid_to is null
and DBT_INTERNAL_SOURCE.dbt_change_type in ('update', 'delete')
then update
set dbt_valid_to = DBT_INTERNAL_SOURCE.dbt_valid_to

Here's the associated implementation for dbt-postgres:

update {{ target }}
set dbt_valid_to = DBT_INTERNAL_SOURCE.dbt_valid_to
from {{ source }} as DBT_INTERNAL_SOURCE
where DBT_INTERNAL_SOURCE.dbt_scd_id::text = {{ target }}.dbt_scd_id::text
and DBT_INTERNAL_SOURCE.dbt_change_type::text in ('update'::text, 'delete'::text)
and {{ target }}.dbt_valid_to is null;

If we do nothing, we should update the documentation.
If we want it to be, in my opinion, correct, then we should update the code and not update the documentation.

In my opinion it does not make sense that the dbt_valid_to would be greater timestamp than dbt_updated_at. The record was updated, so it should have a dbt_updated_at at that time.

I do understand this is changing something that has been around for 4 years, so there is probably some concern there.

Agreed that it's been implemented this way for 4+ years and it might do more harm than good to change it. At the very least, we'd need to understand all the implications extremely thoroughly before considering making such a change.

What row-level data point would a user want that they can't get right now?

Right now, it looks like dbt_updated_at basically means "created at" (when using the check strategy) and "earliest valid time" otherwise. If a user wants the latest time at which the row was "modified at", then this might do the trick:

coalesce(`dbt_valid_to`, `dbt_valid_from`)

If you are always looking for a "system time" (a la SQL:2011), then that is not possible with the timestamp strategy, because that information is not always* captured and stored -- it is only captured and stored for the check strategy.

* When using invalidate_hard_deletes and a key is hard-deleted, dbt_valid_to will be a "system time" rather than an "application time".

I'm leaning heavily towards leaving the implementation as-is and updating the documentation.

We have a listing of gotchas related to snapshots listed here, and the situation you are bringing up looks like it would another good entry for that list.

@patkearns10 Thanks again for noticing this behavior and posting an elucidating write-up 🤩 Since there would be more risk than reward to change the behavior, I'm going to close this as wont_fix. I've opened a PR to update the docs in dbt-labs/docs.getdbt.com#3540

An option in user space

If a user needs/wants the data to be populated differently, they can override the key macro in their dbt project like this (if they are using Postgres):

{% macro postgres__snapshot_merge_sql(target, source, insert_cols) -%}
    {%- set insert_cols_csv = insert_cols | join(', ') -%}

    update {{ target }}
    set dbt_valid_to = DBT_INTERNAL_SOURCE.dbt_valid_to
    from {{ source }} as DBT_INTERNAL_SOURCE
    where DBT_INTERNAL_SOURCE.dbt_scd_id::text = {{ target }}.dbt_scd_id::text
      and DBT_INTERNAL_SOURCE.dbt_change_type::text in ('update'::text, 'delete'::text)
      and {{ target }}.dbt_valid_to is null;

    insert into {{ target }} ({{ insert_cols_csv }})
    select {% for column in insert_cols -%}
        DBT_INTERNAL_SOURCE.{{ column }} {%- if not loop.last %}, {%- endif %}
    {%- endfor %}
    from {{ source }} as DBT_INTERNAL_SOURCE
    where DBT_INTERNAL_SOURCE.dbt_change_type::text = 'insert'::text;
{% endmacro %}

The logic to use would vary by adapter. If an adapter doesn't provide their own override of snapshot_merge_sql, then it use the default logic here, and that would be the code to copy-paste-modify into their own dbt project.

Right on, I understand the sentiment about not wanting to update the code.

Your workaround is precisely how I tested as well:
Screen Shot 2023-06-16 at 10 37 31 AM

hi @patkearns10 and @dbeatty10,

thanks for this post, i had the same issue and fix this that way.

however by overriding the macro snapshot_merge_sql in he dbt project, it looks like this affect all existing strategies (timestamp, check, custom strategy created).

Is it possible to change the default behaviour of the macro only for one strategy (for example the custom strategy created) and not the others in he dbt project ?

Thanks in advance

@ntsteph it sounds like you've created a custom snapshot strategy, and you're wondering if you can change the default behavior of snapshot_merge_sql for just that custom snapshot strategy?

If so, you'd need to override the definition of the snapshot materialization for your adapter (here for dbt-postgres). So although it's possible, there's no "easy" way to override the logic for snapshot_merge_sql for a single snapshot strategy (but not the others).

From snapshot discussion:

Problem 7: dbt_updated_at is a "system time" for the check strategy (a la SQL:2011), but it is an "application time" for the timestamp strategy.

  • 💡 Feature idea 9: What if dbt either added a new column named dbt_modified_at or actually updated dbt_updated_at at the same time as dbt_valid_to?
  • 💡 Feature idea 10: What if dbt treated dbt_modified_at/dbt_updated_at as a "system time" for both snapshot strategies (since dbt_valid_from & dbt_valid_to represent the "application time" for both strategies)?
  • 💡 Feature idea 11: To complement dbt_modified_at, what if dbt added an immutable dbt_created_at column (representing the "system time" that the row was inserted)?