dbt-labs/dbt-core

[spike+] option to generate dbt_scd_id as an integer column instead of a string for performance improvements

graciegoheen opened this issue · 1 comments

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

When dbt generates your snapshot, one of the meta-fields it creates is dbt_scd_id - a unique key generated for each snapshotted record, used internally by dbt.

Currently, dbt_scd_id is a string because of the hashing function used. dbt_scd_id is a combo of unique_key + updated_at for timestamp strategy (essentially creates a surrogate key).

Some folks want dbt_scd_id to instead be an integer for better performance.

If we swapped to an integer, we’d have to use a hasing function that outputted an integer instead of a string.

  • integers collide a lot more easily than a string
  • if we get a collision, we get unintended behavior (will fail silently!)

Because of the risk of collision, we wouldn't want to make this the default for all users.

Instead, what if we had a config that allowed you to control the hashing function used when generating dbt_scd_id?

Describe alternatives you've considered

Creating a custom materialization to override the outputs from generate_surrogate_key