[Bug] New columns not populated with data in a snapshot
elsander opened this issue ยท 9 comments
Is this a new bug in dbt-core?
- I believe this is a new bug in dbt-core
- I have searched the existing issues, and I could not find an existing issue for this bug
Current Behavior
I experienced an edge case where new columns get added to a snapshot, but are not populated with data.
Steps to reproduce:
- Create a table with static data
- Snapshot the table with
check_cols='all'
- Manually add a new column to the table (populate it with non-null values). Do not update any of the data from step 1.
- Snapshot the table again
After step 2, I had the expected snapshot with one row per unique id, populated with the data from the table in step 1. After step 4, I had the same snapshot, but with a new column added. The new column was populated entirely with nulls.
Expected Behavior
In step 4, I expect the snapshot to add a new row for each unique id, which contains all data in the table, including the values in the new column. Instead, no new rows were added to the snapshot, and the new column values were not populated in the snapshot.
One interesting note: if any previously snapshot values change (even in only one row), ALL unique ids get a new snapshot row with data from the new columns. So you would only encounter this problem if the previously snapshot data stays entirely static.
Steps To Reproduce
- Environment:
Linux 0584341454eb 5.15.49-linuxkit # 1 SMP Tue Sep 13 07:51:46 UTC 2022 x86_64 GNU/Linux
Python version 3.9.19
DBT version 1.6.13, bigquery plugin version 1.6.1 - Build a table called
snapshot_test
, filling intbl
with some valid table for your environment:
SELECT DISTINCT
'123' as id,
1 as val_1
FROM {{ tbl }}
UNION ALL
SELECT DISTINCT
'456' as id,
1 as val_1
FROM {{ tbl }}
- Create an associated snapshot for the table:
{% snapshot snapshot__snapshot_test %}
{{
config(
unique_key='id',
strategy='check',
invalidate_hard_deletes=True,
check_cols='all'
)
}}
SELECT *
FROM {{ ref('snapshot_test') }}
{% endsnapshot %}
- Update the and rebuild
snapshot_test
:
SELECT DISTINCT
'123' as id,
1 as val_1,
'test' as val_2,
FROM {{ tbl }}
UNION ALL
SELECT DISTINCT
'456' as id,
1 as val_1,
'test_2' as val_2
FROM {{ tbl }}
- Rerun snapshot
This results in a snapshot with only one row for each id, where the val_2
column is present, but filled with null.
Note that if you run the following in step 3 instead (note the update to one of the values in val_1
):
SELECT DISTINCT
'123' as id,
2 as val_1,
'test' as val_2,
FROM {{ tbl }}
UNION ALL
SELECT DISTINCT
'456' as id,
1 as val_1,
'test_2' as val_2
FROM {{ tbl }}
A new row will be added to the snapshot as expected, with non-null values for the new columns.
Relevant log output
No response
Environment
Linux 0584341454eb 5.15.49-linuxkit #1 SMP Tue Sep 13 07:51:46 UTC 2022 x86_64 GNU/Linux
Python version 3.9.19
DBT version 1.6.13, bigquery plugin version 1.6.1
Which database adapter are you using with dbt?
bigquery
Additional Context
No response
Thanks for reporting this @elsander.
I wasn't able to get the same result as you with dbt=1.6.5 and bigquery=1.6.9. Instead, I got this after the 2nd snapshot:
id | val_1 | val_2 | dbt_valid_from | dbt_valid_to |
---|---|---|---|---|
123 | 1 | test | 2024-05-04 | |
123 | 1 | 2024-05-04 | 2024-05-04 | |
456 | 1 | test_2 | 2024-05-04 | |
456 | 1 | 2024-05-04 | 2024-05-04 |
Could you try using the the files below and let me know if there's something that can be tweaked to see the scenario you are reporting?
Reprex
macros/drop_table.sql
{% macro drop_table(name) %}
{% set sql %}
drop table {{ ref(name) }};
{% endset %}
{% do log("query: " ~ sql, info=True) %}
{% do run_query(sql) %}
{% do log("Table dropped", info=True) %}
{% endmacro %}
models/dual.sql
select 'X' as dummy
models/snapshot_test.sql
-- will be overwritten with one of two versions of the file from below
snapshot_test__v1.sql
SELECT DISTINCT
'123' as id,
1 as val_1
FROM {{ ref("dual") }}
UNION ALL
SELECT DISTINCT
'456' as id,
1 as val_1
FROM {{ ref("dual") }}
snapshot_test__v2.sql
SELECT DISTINCT
'123' as id,
1 as val_1,
'test' as val_2
FROM {{ ref("dual") }}
UNION ALL
SELECT DISTINCT
'456' as id,
1 as val_1,
'test_2' as val_2
FROM {{ ref("dual") }}
analyses/snapsphot_1.sql
select
id,
val_1,
cast({{ date_trunc('day', 'dbt_valid_from') }} as date) as dbt_valid_from,
cast({{ date_trunc('day', 'dbt_valid_to') }} as date) as dbt_valid_to
from {{ ref("snapshot__snapshot_test") }}
order by dbt_valid_from, id
analyses/snapsphot_2.sql
select
id,
val_1,
val_2,
cast({{ date_trunc('day', 'dbt_valid_from') }} as date) as dbt_valid_from,
cast({{ date_trunc('day', 'dbt_valid_to') }} as date) as dbt_valid_to
from {{ ref("snapshot__snapshot_test") }}
order by dbt_valid_from, id
snapshots/snapshot__snapshot_test.sql
{% snapshot snapshot__snapshot_test %}
{{
config(
target_database=target.database,
target_schema=target.schema,
unique_key='id',
strategy='check',
invalidate_hard_deletes=True,
check_cols='all'
)
}}
SELECT *
FROM {{ ref('snapshot_test') }}
{% endsnapshot %}
Run these commands:
dbt run-operation drop_table --args '{name: snapshot__snapshot_test }'
dbt run -s dual
cp snapshot_test__v1.sql models/snapshot_test.sql
dbt run -s models/snapshot_test.sql
dbt snapshot -s snapshots/snapshot__snapshot_test.sql
dbt show -s analyses/snapsphot_1.sql
cp snapshot_test__v2.sql models/snapshot_test.sql
dbt run -s models/snapshot_test.sql
dbt snapshot -s snapshots/snapshot__snapshot_test.sql
dbt show -s analyses/snapsphot_2.sql
Hopefully I didn't forget any files -- just let me know if I did.
@dbeatty10 that is truly bizarre; I used your example and replicated the same issue I reported.
The only differences between your test and my replication of your test:
- My
dbt_project.yml
pre-specifies the target database and schema for snapshots, so I did not explicitly provide those arguments. - I have my own table deletion macro, which I ran instead of yours (I confirmed that the tables were in fact deleted before running the test)
- Instead of running
dbt show
, I queried the snapshot output in the BigQuery console.
We are using slightly different microversions of DBT/BQ plugin, and you're probably using a somewhat different OS environment. My best guess would be that the OS environment or a microversion update are the root cause here. Unfortunately, I can't easily tweak these, because I use a Docker image that I am not the maintainer for.
Thanks for trying this out @elsander ๐
The OS environment shouldn't make a difference here. I went ahead and adjusted my dbt_project.yml
and snapshots/snapshot__snapshot_test.sql
as follows:
modified files
dbt_project.yml
name: "my_project"
version: "1.0.0"
config-version: 2
profile: "bigquery"
snapshots:
my_project:
+target_database: "{{ target.database }}"
+target_schema: "{{ target.schema }}"
snapshots/snapshot__snapshot_test.sql
{% snapshot snapshot__snapshot_test %}
{{
config(
unique_key='id',
strategy='check',
invalidate_hard_deletes=True,
check_cols='all'
)
}}
SELECT *
FROM {{ ref('snapshot_test') }}
{% endsnapshot %}
But when I tried this out with the same microversions as you, it still worked as expected for me:
17:17:54 Running with dbt=1.6.13
17:17:55 Registered adapter: bigquery=1.6.1
17:17:55 Found 2 models, 1 snapshot, 2 analyses, 3 seeds, 0 sources, 0 exposures, 0 metrics, 394 macros, 0 groups, 0 semantic models
17:17:55
17:17:55 Concurrency: 10 threads (target='blue')
17:17:55
17:17:57 Previewing node 'snapsphot_2':
| id | val_1 | val_2 | dbt_valid_from | dbt_valid_to |
| --- | ----- | ------ | -------------- | ------------ |
| 123 | 1 | | 2024-05-06 | 2024-05-06 |
| 123 | 1 | test | 2024-05-06 | |
| 456 | 1 | | 2024-05-06 | 2024-05-06 |
| 456 | 1 | test_2 | 2024-05-06 | |
To reduce as many confounding variables as possible, could you try using all the same files and commands as me and share the terminal output ?
dbt_project.yml
- The
drop_table
table deletion macro from above snapshots/snapshot__snapshot_test.sql
- All the same commands
@dbeatty10 Really bizarre-- I tried your version, using all of the same files and commands, and I got the behavior you did, where it's correctly updating the table with new data. The main difference I could at all imagine to be relevant is that I directly specified the target database and schema in the snapshot config.
Here is the snapshot section of my dbt_project.yml in case it's helpful for debugging:
snapshots:
cohort_mart:
+docs:
node_color: "#E54F6D"
ie_warehouse:
+target_schema: snapshots__ie_warehouse
cohort_utility_warehouse:
+target_schema: snapshots__cohort_utility_warehouse
golden_cohort_warehouse:
+target_schema: snapshots__golden_cohort_warehouse
cohort_self_storage:
+target_schema: snapshots__cohort_self_storage
That is bizarre, but also good news that we're narrowing in!
Could you try tweaking my version and see if you can get it to give the same unexpected results that you originally posted?
If you can do that, then we'll have a solid reproducible example ("reprex").
Ok @dbeatty10, I spent some time trying to reproduce the issue. I was able to reproduce the issue with the following changes:
dual.sql
{{ config(schema="cohort_self_storage", materialized="table") }}
select 'X' as dummy
snapshot_test__v1.sql
SELECT DISTINCT
'123' as id,
1 as val_1
FROM {{ ref("dual") }}
UNION ALL
SELECT DISTINCT
'456' as id,
1 as val_1
FROM {{ ref("dual") }}
snapshot_test__v2.sql
{{ config(schema="cohort_self_storage", materialized="table") }}
SELECT DISTINCT
'123' as id,
1 as val_1,
'test' as val_2
FROM {{ ref("dual") }}
UNION ALL
SELECT DISTINCT
'456' as id,
1 as val_1,
'test_2' as val_2
FROM {{ ref("dual") }}
Without both of those config options, I am not able to reproduce the issue. With them, I do. These match settings I had as default in my dbt_project.yml
, and if I don't specify them explicitly from the files, but use the defaults in the yml, I still replicate the issue. Let me know if you're able to reproduce on your end.
@elsander I still haven't been able to reproduce this.
Do you get something different when you use the files and commands below?
Do you see anything I've configured differently than the example you used to reproduce the issue?
Project files and commands
macros/drop_table.sql
{% macro drop_table(name) %}
{% set sql %}
drop table {{ ref(name) }};
{% endset %}
{% do log("query: " ~ sql, info=True) %}
{% do run_query(sql) %}
{% do log("Table dropped", info=True) %}
{% endmacro %}
models/dual.sql
{{ config(schema="cohort_self_storage", materialized="table") }}
select 'X' as dummy
snapshot_test__v1.sql
SELECT DISTINCT
'123' as id,
1 as val_1
FROM {{ ref("dual") }}
UNION ALL
SELECT DISTINCT
'456' as id,
1 as val_1
FROM {{ ref("dual") }}
snapshot_test__v2.sql
{{ config(schema="cohort_self_storage", materialized="table") }}
SELECT DISTINCT
'123' as id,
1 as val_1,
'test' as val_2
FROM {{ ref("dual") }}
UNION ALL
SELECT DISTINCT
'456' as id,
1 as val_1,
'test_2' as val_2
FROM {{ ref("dual") }}
analyses/snapsphot_1.sql
select
id,
val_1,
cast({{ date_trunc('day', 'dbt_valid_from') }} as date) as dbt_valid_from,
cast({{ date_trunc('day', 'dbt_valid_to') }} as date) as dbt_valid_to
from {{ ref("snapshot__snapshot_test") }}
order by dbt_valid_from, id
analyses/snapsphot_2.sql
select
id,
val_1,
val_2,
cast({{ date_trunc('day', 'dbt_valid_from') }} as date) as dbt_valid_from,
cast({{ date_trunc('day', 'dbt_valid_to') }} as date) as dbt_valid_to
from {{ ref("snapshot__snapshot_test") }}
order by dbt_valid_from, id
Run these commands:
dbt run-operation drop_table --args '{name: snapshot__snapshot_test }'
dbt run -s dual
cp snapshot_test__v1.sql models/snapshot_test.sql
dbt run -s models/snapshot_test.sql
dbt snapshot -s snapshots/snapshot__snapshot_test.sql
dbt show -s analyses/snapsphot_1.sql
cp snapshot_test__v2.sql models/snapshot_test.sql
dbt run -s models/snapshot_test.sql
dbt snapshot -s snapshots/snapshot__snapshot_test.sql
dbt show -s analyses/snapsphot_2.sql
Get this output:
16:17:14 Previewing node 'snapsphot_2':
| id | val_1 | val_2 | dbt_valid_from | dbt_valid_to |
| --- | ----- | ------ | -------------- | ------------ |
| 123 | 1 | test | 2024-05-09 | |
| 123 | 1 | | 2024-05-09 | 2024-05-09 |
| 456 | 1 | test_2 | 2024-05-09 | |
| 456 | 1 | | 2024-05-09 | 2024-05-09 |
Yeah, I reran it today, with the same setup as before, and I couldn't reproduce it, either. I'm really at a loss. It was on the same commit where I'd replicated the issue before. It's really bizarre.
I think my next step will be to see if others on my team can reproduce the issue as well from my original setup, to figure out if it's something about the DBT project config, or something specific to my computer/Docker container/??.