dbt-Trino snapshot cannot create __dbt_tmp table after first run
mx-dwolff opened this issue · 12 comments
A snapshot model can be created and run once, but all subsequent attempts at running the snapshot model will result in an error. This appears to be due to the fact that dbt is attempting to create a temporary table in the same location as the final snapshot table. By default, Trino will not allow multiple tables to be created in the same location since it can lead to corrupt data.
Additionally, there does not appear to be any method of specifying multiple location parameters in the config block (such as one for final snapshot table, another for any temporary tables created as part of processing).
☝️ This is with Iceberg tables in AWS S3, and using the AWS Glue catalog. The it tries to write _dbt_tmp
to the same s3 location as the final table.
I am using Hive Metastore and have encountered the same issue with S3 storage on MinIO,how to solve this?
Can you provide more details?
I got so far:
Catalog type = Iceberg
Metastore = Glue or Hive Metastore
On which platform (Trino, Galaxy, SEP?), which versions and what catalog properties are set.
Which dbt-trino version are you using?
platform = Trino
Trino version = 425 (upgrading to 426 shortly)
Catalog properties = nothing set (so default?)
Running with dbt=1.5.2
Registered adapter: trino=1.5.0
Please let me know if you have any other questions
@mx-dwolff Can you show exact error log? And can you also show snapshot model configuration?
error log:
20:27:41 Database Error in snapshot accounts_snapshot (snapshots\accounts_snapshot.sql)
20:27:41 TrinoExternalError(type=EXTERNAL, name=ICEBERG_FILESYSTEM_ERROR, message="Cannot create a table on a non-empty location: s3://bucket_location/iceberg/mgp/protected/accounts_
snapshot, set 'iceberg.unique-table-location=true' in your Iceberg catalog properties to use unique table locations for every table.", query_id=20230927_202740_01958_28gf4)
20:27:41
20:27:41 Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
model config:
{{
config(
materialized = 'snapshot',
on_table_exists = 'drop',
unique_key = 'account_number',
strategy = 'timestamp',
updated_at = 'derv_updated_at',
properties={
"format" : "'PARQUET'" ,
"format_version" : "2" ,
"location" : "'s3://bucket_location/iceberg/mgp/protected/accounts_snapshot/'"
}
)
}}
FYI -- "bucket_location" was my edit in place of actual bucket name
@mx-dwolff Currently, snapshots do not work correctly when specifying location
property. This issue arises because the snapshot model is initially created in a specified location, and on subsequent runs of the dbt snapshot
command, temp table is attempted to be created in the same location, resulting in an error.
When the location
table property is omitted, the content of the table is stored in a subdirectory under the directory corresponding to the schema location (docs on that).
Therefore, omitting location property would be an immiediate solution.
So, is there a specific reason why you are explicitly specifying the table location? Wouldn't default location (subdirectory in schema location) work for your case?
@damian3031 Thanks for this info! I will give that a shot and follow up if errors continue.
I do still find it a bit odd that other dbt operations that utilize a similar approach -- such as an incremental model that uses a merge strategy -- can create temporary views (instead of tables) that avoid this problem altogether. Is there a particular reason an incremental model can utilize a temporary view whereas the snapshots require a temporary table? It's not an absolute necessity to specify a location property, however it helps provide greater clarity and control into where the data is being stored.
@mx-dwolff Using a view puts us at risk of losing track of changes. It's because in a view the columns are static while the data is dynamic. For example, if the table schema is changed during the snapshotting, we could have changes getting merged into the snapshot table that doesn't contain the values of newly added columns after the creation of the snapshot view.
If the snapshot uses a last modified timestamp, any values for added columns since creating the view won't be inserted in the snapshot table. Next time, they will be ignored since the max modified timestamp in the snapshot table will think it has already processed those values.
Because of the above, we can't use views in snapshot materialization.
One potential solution could be to create a schema with a specific location first, by adding below config in dbt_project.yml
:
on-run-start: "create schema if not exists snapshots_schema with (location = 's3://datalake/iceberg/mgp/protected/accounts_snapshot')"
removing location
, and adding target_schema='snapshots_schema'
property to model configuration.
This way, schema would be created in the specified location, and tables would be created in subdirectories within the schema location. Temporary table will also be created in a subdirectory, so it won't interfere with the snapshot table.
It may be a bit cumbersome to specify it in on-run-start
config, as it will be executed at the beginning of every dbt command, but it will work.
There is some discussion about configuring and managing schemas in similar way to models, which would be the right way to do it: dbt-labs/dbt-core#5781
Currently there is no easy way to support location property for snapshot models in dbt-trino.
As mentioned, solution would be to remove that property.
Since version 1.7.1, dbt-trino raises an explicit error about not supporting this comibnation.