dbt-labs/dbt-athena

[Feature] Support configurable management of Table Optimisers for Iceberg tables

antonysouthworth-halter opened this issue · 3 comments

Is this your first time submitting a feature request?

  • I have searched the existing issues, and I could not find an existing issue for this feature

Describe the feature

https://aws.amazon.com/blogs/aws/aws-glue-data-catalog-now-supports-automatic-compaction-of-apache-iceberg-tables/

Honestly I have not fully thought through how it would work, hoping to spark some discussion in thread.

Perhaps just another config variable for this? E.g.

{{config(
    materialized='incremental', -- (or sp_insert_by_period; not relevant for table/view)
    table_type='iceberg',
    use_glue_automatic_compaction=true,
)}}

When use_glue_automatic_compaction is specified, then we would use the Glue {Create,Update}TableOptimizer API operations to create the optimiser for compaction.

Describe alternatives you've considered

You can just OPTIMIZE {{ this.schema }}.{{ this.identifier }} ... in your post_hook yes, but on full-refresh of a very large table (e.g. requiring insert_by_period) this may fail due to timeout or the iceberg "not finished, please run compaction again" message. Regardless I think it would be good to let AWS just handle it.

Caveat; I haven't actually tried to use the automatic compaction feature so I have no idea how it performs in practise. Maybe it just scan your entire table once a day and you get charged for 100 DPUs 😂.

Who will this benefit?

Anybody with large datasets in Iceberg. I would think quite a lot of overlap with users of insert_by_period.

Are you interested in contributing this feature?

maybe, depends how much work it would be

Anything else?

#514 somewhat related, in the realm of "table optimisation management"

@antonysouthworth-halter regarding this:

You can just OPTIMIZE {{ this.schema }}.{{ this.identifier }} ... in your post_hook yes, but on full-refresh of a very large table (e.g. requiring insert_by_period) this may fail due to timeout or the iceberg "not finished, please run compaction again" message. Regardless I think it would be good to let AWS just handle it.

If you partition your table I can share a "post-hook", that is ugly, but does the job, and pretty much optimize your table by partition values, using a batch size < 100, to avoid partition limitation issue in athena.

Said so, what you describe can be relevant, and should be relatively easy to implement:

  • we need a new method in the impl.py that does a boto3call add automatic compaction
  • we expose the method via a macro that then can be called in different table materializations.

If you partition your table I can share a "post-hook", that is ugly, but does the job, and pretty much optimize your table by partition values, using a batch size < 100, to avoid partition limitation issue in athena.

😮 I would love to see it! I think it might also be helpful for others that stumble upon this ticket

Also note that now we have a fix on optimize post_hook on dbt athena which will retry if it encounters an issue because optimize needs to be run again, cf here: https://github.com/dbt-athena/dbt-athena/blob/34633a7f5679344852d1004991fc814ab385dadb/dbt/adapters/athena/impl.py#L1336