[Feature] Support configurable management of Table Optimisers for Iceberg tables
antonysouthworth-halter opened this issue · 3 comments
Is this your first time submitting a feature request?
- I have searched the existing issues, and I could not find an existing issue for this feature
Describe the feature
Honestly I have not fully thought through how it would work, hoping to spark some discussion in thread.
Perhaps just another config variable for this? E.g.
{{config(
materialized='incremental', -- (or sp_insert_by_period; not relevant for table/view)
table_type='iceberg',
use_glue_automatic_compaction=true,
)}}
When use_glue_automatic_compaction
is specified, then we would use the Glue {Create,Update}TableOptimizer
API operations to create the optimiser for compaction.
Describe alternatives you've considered
You can just OPTIMIZE {{ this.schema }}.{{ this.identifier }} ...
in your post_hook
yes, but on full-refresh of a very large table (e.g. requiring insert_by_period
) this may fail due to timeout or the iceberg "not finished, please run compaction again" message. Regardless I think it would be good to let AWS just handle it.
Caveat; I haven't actually tried to use the automatic compaction feature so I have no idea how it performs in practise. Maybe it just scan your entire table once a day and you get charged for 100 DPUs 😂.
Who will this benefit?
Anybody with large datasets in Iceberg. I would think quite a lot of overlap with users of insert_by_period
.
Are you interested in contributing this feature?
maybe, depends how much work it would be
Anything else?
#514 somewhat related, in the realm of "table optimisation management"
@antonysouthworth-halter regarding this:
You can just OPTIMIZE {{ this.schema }}.{{ this.identifier }} ... in your post_hook yes, but on full-refresh of a very large table (e.g. requiring insert_by_period) this may fail due to timeout or the iceberg "not finished, please run compaction again" message. Regardless I think it would be good to let AWS just handle it.
If you partition your table I can share a "post-hook", that is ugly, but does the job, and pretty much optimize your table by partition values, using a batch size < 100, to avoid partition limitation issue in athena.
Said so, what you describe can be relevant, and should be relatively easy to implement:
- we need a new method in the impl.py that does a
boto3
call add automatic compaction - we expose the method via a macro that then can be called in different table materializations.
If you partition your table I can share a "post-hook", that is ugly, but does the job, and pretty much optimize your table by partition values, using a batch size < 100, to avoid partition limitation issue in athena.
😮 I would love to see it! I think it might also be helpful for others that stumble upon this ticket
Also note that now we have a fix on optimize
post_hook on dbt athena which will retry if it encounters an issue because optimize needs to be run again, cf here: https://github.com/dbt-athena/dbt-athena/blob/34633a7f5679344852d1004991fc814ab385dadb/dbt/adapters/athena/impl.py#L1336