[CT-2869] Add new snapshot records when source removes a field
owen-mc-git opened this issue · 2 comments
Is this your first time submitting a feature request?
- I have read the expectations for open source contributors
- I have searched the existing issues, and I could not find an existing issue for this feature
- I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion
Describe the feature
Summary,
Using DBT snapshots, if a field is removed from source no new records are created in the snapshot table to show that the prior values no longer exists in source. This is to request that this functionality is added/supported.
This functionality exists when a new field is added but not the reverse.
Expand:
When using DBT snapshots, if a new field added in the source, it will also be added to the snapshot table and a new record is created for each existing record in the table to show the new field added/updated at this new point in time.
However if a field is removed from source no new records are created in the snapshot table to show that the prior value no longer exists in source. When a net new record is added we do see a null value appear in the removed field location.
Not having a new record created for data that is removed gives a false sense that the source still has these values.
Example:
Source orig | COL_1 | COL_2 | COL_3 |
---|---|---|---|
1 | abc1 | def2 | ghi3 |
2 | abc2 | def3 | ghi4 |
3 | abc3 | def4 | ghi5 |
Snapshot orig | COL_1 | COL_2 | COL_3 | period_start | period_end |
---|---|---|---|---|---|
1 | abc1 | def2 | ghi3 | 23-Jul-2023 | null |
2 | abc2 | def3 | ghi4 | 23-Jul-2023 | null |
3 | abc3 | def4 | ghi5 | 23-Jul-2023 | null |
source adding a new column on the 24th
Source Add column | COL_1 | COL_2 | COL_3 | COL4 |
---|---|---|---|---|
1 | abc1 | def2 | ghi3 | jkl4 |
2 | abc2 | def3 | ghi4 | jkl5 |
3 | abc3 | def4 | ghi5 | jkl6 |
Snapshot changes | COL_1 | COL_2 | COL_3 | COL_4 | period_start | period_end |
---|---|---|---|---|---|---|
1 | abc1 | def2 | ghi3 | null | 23-Jul-2023 | 24-Jul-2023 |
2 | abc2 | def3 | ghi4 | null | 23-Jul-2023 | 24-Jul-2023 |
3 | abc3 | def4 | ghi5 | null | 23-Jul-2023 | 24-Jul-2023 |
1 | abc1 | def2 | ghi3 | jkl4 | 24-Jul-2023 | null |
2 | abc2 | def3 | ghi4 | jkl5 | 24-Jul-2023 | null |
3 | abc3 | def4 | ghi5 | jkl6 | 24-Jul-2023 | null |
source removing a column and adding new record on the 25th
Source remove column | COL_1 | COL_2 | COL4 |
---|---|---|---|
1 | abc1 | def2 | jkl4 |
2 | abc2 | def3 | jkl5 |
3 | abc3 | def4 | jkl6 |
4 | abc4 | def5 | jkl7 |
Snapshot unchanged | COL_1 | COL_2 | COL_3 | COL_4 | period_start | period_end |
---|---|---|---|---|---|---|
1 | abc1 | def2 | ghi3 | null | 23-Jul-2023 | 24-Jul-2023 |
2 | abc2 | def3 | ghi4 | null | 23-Jul-2023 | 24-Jul-2023 |
3 | abc3 | def4 | ghi5 | null | 23-Jul-2023 | 24-Jul-2023 |
1 | abc1 | def2 | ghi3 | jkl4 | 24-Jul-2023 | null |
2 | abc2 | def3 | ghi4 | jkl5 | 24-Jul-2023 | null |
3 | abc3 | def4 | ghi5 | jkl6 | 24-Jul-2023 | null |
4 | abc4 | def5 | null | jkl7 | 25-Jul-2023 | null |
what I would expect to see in a true SCD table
Snapshot | COL_1 | COL_2 | COL_3 | COL_4 | period_start | period_end |
---|---|---|---|---|---|---|
1 | abc1 | def2 | ghi3 | null | 23-Jul-2023 | 24-Jul-2023 |
2 | abc2 | def3 | ghi4 | null | 23-Jul-2023 | 24-Jul-2023 |
3 | abc3 | def4 | ghi5 | null | 23-Jul-2023 | 24-Jul-2023 |
1 | abc1 | def2 | ghi3 | jkl4 | 24-Jul-2023 | 25-Jul-2023 |
2 | abc2 | def3 | ghi4 | jkl5 | 24-Jul-2023 | 25-Jul-2023 |
3 | abc3 | def4 | ghi5 | jkl6 | 24-Jul-2023 | 25-Jul-2023 |
1 | abc1 | def2 | null | jkl4 | 25-Jul-2023 | null |
2 | abc2 | def3 | null | jkl5 | 25-Jul-2023 | null |
3 | abc3 | def4 | null | jkl6 | 25-Jul-2023 | null |
4 | abc4 | def5 | null | jkl7 | 25-Jul-2023 | null |
Describe alternatives you've considered
No response
Who will this benefit?
users of snapshots who wish to see the true state of a source at the correct points in time
Are you interested in contributing this feature?
No response
Anything else?
FAQ section "What happens if I add new columns to my snapshot query?" https://docs.getdbt.com/docs/build/snapshots
+1, would love to see this handled as a config option similar to invalidate_hard_deletes
.
+1