dbt-labs/dbt-core

[CT-2869] Add new snapshot records when source removes a field

owen-mc-git opened this issue · 2 comments

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Summary,
Using DBT snapshots, if a field is removed from source no new records are created in the snapshot table to show that the prior values no longer exists in source. This is to request that this functionality is added/supported.
This functionality exists when a new field is added but not the reverse.

Expand:
When using DBT snapshots, if a new field added in the source, it will also be added to the snapshot table and a new record is created for each existing record in the table to show the new field added/updated at this new point in time.

However if a field is removed from source no new records are created in the snapshot table to show that the prior value no longer exists in source. When a net new record is added we do see a null value appear in the removed field location.
Not having a new record created for data that is removed gives a false sense that the source still has these values.

Example:

Source orig COL_1 COL_2 COL_3
1 abc1 def2 ghi3
2 abc2 def3 ghi4
3 abc3 def4 ghi5
Snapshot orig COL_1 COL_2 COL_3 period_start period_end
1 abc1 def2 ghi3 23-Jul-2023 null
2 abc2 def3 ghi4 23-Jul-2023 null
3 abc3 def4 ghi5 23-Jul-2023 null

source adding a new column on the 24th

Source Add column COL_1 COL_2 COL_3 COL4
1 abc1 def2 ghi3 jkl4
2 abc2 def3 ghi4 jkl5
3 abc3 def4 ghi5 jkl6
Snapshot changes COL_1 COL_2 COL_3 COL_4 period_start period_end
1 abc1 def2 ghi3 null 23-Jul-2023 24-Jul-2023
2 abc2 def3 ghi4 null 23-Jul-2023 24-Jul-2023
3 abc3 def4 ghi5 null 23-Jul-2023 24-Jul-2023
1 abc1 def2 ghi3 jkl4 24-Jul-2023 null
2 abc2 def3 ghi4 jkl5 24-Jul-2023 null
3 abc3 def4 ghi5 jkl6 24-Jul-2023 null

source removing a column and adding new record on the 25th

Source remove column COL_1 COL_2 COL4
1 abc1 def2 jkl4
2 abc2 def3 jkl5
3 abc3 def4 jkl6
4 abc4 def5 jkl7
Snapshot unchanged COL_1 COL_2 COL_3 COL_4 period_start period_end
1 abc1 def2 ghi3 null 23-Jul-2023 24-Jul-2023
2 abc2 def3 ghi4 null 23-Jul-2023 24-Jul-2023
3 abc3 def4 ghi5 null 23-Jul-2023 24-Jul-2023
1 abc1 def2 ghi3 jkl4 24-Jul-2023 null
2 abc2 def3 ghi4 jkl5 24-Jul-2023 null
3 abc3 def4 ghi5 jkl6 24-Jul-2023 null
4 abc4 def5 null jkl7 25-Jul-2023 null

what I would expect to see in a true SCD table

Snapshot COL_1 COL_2 COL_3 COL_4 period_start period_end
1 abc1 def2 ghi3 null 23-Jul-2023 24-Jul-2023
2 abc2 def3 ghi4 null 23-Jul-2023 24-Jul-2023
3 abc3 def4 ghi5 null 23-Jul-2023 24-Jul-2023
1 abc1 def2 ghi3 jkl4 24-Jul-2023 25-Jul-2023
2 abc2 def3 ghi4 jkl5 24-Jul-2023 25-Jul-2023
3 abc3 def4 ghi5 jkl6 24-Jul-2023 25-Jul-2023
1 abc1 def2 null jkl4 25-Jul-2023 null
2 abc2 def3 null jkl5 25-Jul-2023 null
3 abc3 def4 null jkl6 25-Jul-2023 null
4 abc4 def5 null jkl7 25-Jul-2023 null

Describe alternatives you've considered

No response

Who will this benefit?

users of snapshots who wish to see the true state of a source at the correct points in time

Are you interested in contributing this feature?

No response

Anything else?

FAQ section "What happens if I add new columns to my snapshot query?" https://docs.getdbt.com/docs/build/snapshots

+1, would love to see this handled as a config option similar to invalidate_hard_deletes.

+1