[CT-2701] Snapshot dbt_updated_at not updated
Opened this issue · 8 comments
Is this a new bug in dbt-core?
- I believe this is a new bug in dbt-core
- I have searched the existing issues, and I could not find an existing issue for this bug
Current Behavior
This was brought up by a customer and I am kinda surprised I have never heard this before.
two fields in the existing row should be updated when the updated_at
timestamp is changed signifying a change of some value, the dbt_valid_to
and the dbt_updated_at
Looks like only one update is happening and the dbt_updated_at
does not get updated.
Expected Behavior
According to the docs, the older row should have two fields updated:
dbt_valid_to
dbt_updated_at
Steps To Reproduce
- create snapshot with timestamp check
- add record
1 as id, '2023-01-01' as updated_at
- run snapshot
- modify record
1 as id, '2023-01-02' as updated_at
- run snapshot
- check the records in the output and see if
dbt_updated_at
has been modified in the original row
Relevant log output
No response
Environment
- OS:
- Python:
- dbt: tested 1.1 and 1.4
Which database adapter are you using with dbt?
No response
Additional Context
No response
Thanks for opening this @patkearns10!
See below for the reproducible example I'm using (with dbt-postgres) that shows the same thing that you are reporting.
But the thing I'm wondering: is this actually a problem that needs to be solved in the implementation of snapshots that use the timestamp
strategy? Or would updating the documentation suffice?
Reprex
snapshots/my_snapshot.sql
{% snapshot my_snapshot %}
{{
config(
target_schema='snapshots',
unique_key='id',
strategy='timestamp',
updated_at='updated_at',
)
}}
{% if var("my_data_version", 1) == 1 %}
select 1 as id, '2023-01-01'::timestamp as updated_at
{% else %}
select 1 as id, '2023-01-02'::timestamp as updated_at
{% endif %}
{% endsnapshot %}
Take the 1st snapshot:
dbt snapshot --vars '{"my_data_version": 1}'
dbt show --inline "select * from {{ ref('my_snapshot') }}"
id | updated_at | dbt_scd_id | dbt_updated_at | dbt_valid_from | dbt_valid_to |
---|---|---|---|---|---|
1 | 2023-01-01 00:00:00 | 5aec37f35c393a7ef... | 2023-01-01 00:00:00 | 2023-01-01 00:00:00 |
Take a 2nd snapshot:
dbt snapshot --vars '{"my_data_version": 2}'
dbt show --inline "select * from {{ ref('my_snapshot') }}"
id | updated_at | dbt_scd_id | dbt_updated_at | dbt_valid_from | dbt_valid_to |
---|---|---|---|---|---|
1 | 2023-01-01 00:00:00 | 5aec37f35c393a7ef... | 2023-01-01 00:00:00 | 2023-01-01 00:00:00 | 2023-01-02 00:00:00 |
1 | 2023-01-02 00:00:00 | 8ff9567d22a84a2c9... | 2023-01-02 00:00:00 | 2023-01-02 00:00:00 |
Here's the default implementation, which we can see only updates the dbt_valid_to
(but not dbt_updated_at
) for rows that are no longer valid:
Here's the associated implementation for dbt-postgres:
If we do nothing, we should update the documentation.
If we want it to be, in my opinion, correct, then we should update the code and not update the documentation.
In my opinion it does not make sense that the dbt_valid_to
would be greater timestamp than dbt_updated_at
. The record was updated, so it should have a dbt_updated_at
at that time.
I do understand this is changing something that has been around for 4 years, so there is probably some concern there.
Agreed that it's been implemented this way for 4+ years and it might do more harm than good to change it. At the very least, we'd need to understand all the implications extremely thoroughly before considering making such a change.
What row-level data point would a user want that they can't get right now?
Right now, it looks like dbt_updated_at
basically means "created at" (when using the check
strategy) and "earliest valid time" otherwise. If a user wants the latest time at which the row was "modified at", then this might do the trick:
coalesce(`dbt_valid_to`, `dbt_valid_from`)
If you are always looking for a "system time" (a la SQL:2011), then that is not possible with the timestamp
strategy, because that information is not always* captured and stored -- it is only captured and stored for the check
strategy.
* When using invalidate_hard_deletes and a key is hard-deleted, dbt_valid_to
will be a "system time" rather than an "application time".
I'm leaning heavily towards leaving the implementation as-is and updating the documentation.
We have a listing of gotchas related to snapshots listed here, and the situation you are bringing up looks like it would another good entry for that list.
@patkearns10 Thanks again for noticing this behavior and posting an elucidating write-up 🤩 Since there would be more risk than reward to change the behavior, I'm going to close this as wont_fix
. I've opened a PR to update the docs in dbt-labs/docs.getdbt.com#3540
An option in user space
If a user needs/wants the data to be populated differently, they can override the key macro in their dbt project like this (if they are using Postgres):
{% macro postgres__snapshot_merge_sql(target, source, insert_cols) -%}
{%- set insert_cols_csv = insert_cols | join(', ') -%}
update {{ target }}
set dbt_valid_to = DBT_INTERNAL_SOURCE.dbt_valid_to
from {{ source }} as DBT_INTERNAL_SOURCE
where DBT_INTERNAL_SOURCE.dbt_scd_id::text = {{ target }}.dbt_scd_id::text
and DBT_INTERNAL_SOURCE.dbt_change_type::text in ('update'::text, 'delete'::text)
and {{ target }}.dbt_valid_to is null;
insert into {{ target }} ({{ insert_cols_csv }})
select {% for column in insert_cols -%}
DBT_INTERNAL_SOURCE.{{ column }} {%- if not loop.last %}, {%- endif %}
{%- endfor %}
from {{ source }} as DBT_INTERNAL_SOURCE
where DBT_INTERNAL_SOURCE.dbt_change_type::text = 'insert'::text;
{% endmacro %}
The logic to use would vary by adapter. If an adapter doesn't provide their own override of snapshot_merge_sql
, then it use the default logic here, and that would be the code to copy-paste-modify into their own dbt project.
hi @patkearns10 and @dbeatty10,
thanks for this post, i had the same issue and fix this that way.
however by overriding the macro snapshot_merge_sql in he dbt project, it looks like this affect all existing strategies (timestamp, check, custom strategy created).
Is it possible to change the default behaviour of the macro only for one strategy (for example the custom strategy created) and not the others in he dbt project ?
Thanks in advance
@ntsteph it sounds like you've created a custom snapshot strategy, and you're wondering if you can change the default behavior of snapshot_merge_sql
for just that custom snapshot strategy?
If so, you'd need to override the definition of the snapshot materialization for your adapter (here for dbt-postgres). So although it's possible, there's no "easy" way to override the logic for snapshot_merge_sql
for a single snapshot strategy (but not the others).
Problem 7: dbt_updated_at
is a "system time" for the check
strategy (a la SQL:2011), but it is an "application time" for the timestamp
strategy.
- 💡 Feature idea 9: What if dbt either added a new column named
dbt_modified_at
or actually updateddbt_updated_at
at the same time asdbt_valid_to
? - 💡 Feature idea 10: What if dbt treated
dbt_modified_at
/dbt_updated_at
as a "system time" for both snapshot strategies (sincedbt_valid_from
&dbt_valid_to
represent the "application time" for both strategies)? - 💡 Feature idea 11: To complement
dbt_modified_at
, what if dbt added an immutabledbt_created_at
column (representing the "system time" that the row was inserted)?