[spike+] option to generate dbt_scd_id as an integer column instead of a string for performance improvements
graciegoheen opened this issue · 1 comments
Is this your first time submitting a feature request?
- I have read the expectations for open source contributors
- I have searched the existing issues, and I could not find an existing issue for this feature
- I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion
Describe the feature
When dbt generates your snapshot, one of the meta-fields it creates is dbt_scd_id
- a unique key generated for each snapshotted record, used internally by dbt.
Currently, dbt_scd_id
is a string because of the hashing function used. dbt_scd_id
is a combo of unique_key
+ updated_at
for timestamp
strategy (essentially creates a surrogate key).
Some folks want dbt_scd_id
to instead be an integer for better performance.
If we swapped to an integer, we’d have to use a hasing function that outputted an integer instead of a string.
- integers collide a lot more easily than a string
- if we get a collision, we get unintended behavior (will fail silently!)
Because of the risk of collision, we wouldn't want to make this the default for all users.
Instead, what if we had a config that allowed you to control the hashing function used when generating dbt_scd_id
?
Describe alternatives you've considered
Creating a custom materialization to override the outputs from generate_surrogate_key