[CT-2253] [Feature] Add length variable to `type_string` for redshift
Closed this issue · 5 comments
Is this your first time submitting a feature request?
- I have read the expectations for open source contributors
- I have searched the existing issues, and I could not find an existing issue for this feature
- I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion
Describe the feature
Currently the type_string
macro generates a string type across all warehouses, and for all except redshift this is generated with the maximum length (or a type that has no length argument in general). Redshift is instead generated with a fixed max length of 256, which is reasonable for many cases but sometimes a larger column is required.
This feature would support an optional argument to the type_string
macro, which at a minimum should be applied for redshift, but could be added to warehouses that support fixed length varchars as well.
Describe alternatives you've considered
Overridding the macro, or being explicit about the type for redshfit only.
Who will this benefit?
Any redshift users who also build models across multiple warehoueses, but require string columns over 256 characters.
Are you interested in contributing this feature?
A bit beyond me I think
Anything else?
If this is done it may be better to also have the default redshift be 65535 given that all other warehouses by default give a max-length string column instead.
Code location:
Which calls from
dbt-core/core/dbt/adapters/base/column.py
Line 11 in 8019498
Redshift string datatypes: https://docs.aws.amazon.com/redshift/latest/dg/r_Character_types.html#r_Character_types-text-and-bpchar-types
Thanks for opening this issue @rlh1994 ! Cross-database datatypes are both tricky and fun 🤩
TL;DR
Have you tried something like this:
cast(null as {{ api.Column.string_type(4096) }}) model,
instead of this:
cast(null as {{ snowplow_utils.type_string(4096) }}) model,
?
More detail below including a review of the available methods + some other ideas for you.
What we have today
We have two different groups of macros related to column types that are available in dbt-core today:
api.Column.*_type()
macrosdbt.type_*()
macros
api.Column.*_type()
macros
Accept size/precision/scale parameters and return a data type with size/precision/scale included:
There are static methods within the Column
class (Python). See here for some usage examples.
Pros
- These are useful for enforcing a specific size/precision/scale when casting data
Cons
- These are not suitable for direct comparison with column data types from metadata from the
information_schema.columns
view that many databases support.
dbt.type_*()
macros
Accept no parameters and return the name of a data type without size, precision, or scale:
- dbt.type_bigint()
- dbt.type_boolean()
- dbt.type_float()
- dbt.type_int()
- dbt.type_numeric()
- dbt.type_string()
- dbt.type_timestamp()
Pros
- These may be useful when casting data (see con below)
- These may be suitable for direct comparison with column data types from metatadata from the
information_schema.columns
view that many databases support. It depends on the adapter implementation.
Cons
- These are not be useful enforcing a specific size/precision/scale when casting data
What we could consider for the future
More questions than answers right now.
Should there be more than seven dbt.type_*
macros?
- There's only one
type_timestamp
macro, but the SQL standard incudes two different types (and Snowflake has three!). - There's currently no macros for
type_time
,type_date
,type_array
,type_json
, etc
Should there be more than two api.Column.*_type()
macros?
- Maybe default to the value of the corresponding
dbt.type_*
macro, but accept parameters for relevant types likenumeric
,string
,array
, etc.
Should the api.Column.*_type()
macros be aliased (or moved) to the dbt
or adapter
namespaces to enable usage like {{ dbt.string_type(4096) }}
or {{ adapter.string_type(4096) }}
?
- My guess is no to moving, but still worth considering the trade-offs and consequences.
Ideas for your use case
Your mileage may vary for all of the ideas presented below. In fact, I'd expect for you to tweak nearly all of them to fit your specific needs.
If you want an explicit text length of 1024
that should work across adapters, you could try string_type()
, like this:
{{ api.Column.string_type(1024) }}
Alternatively, you could replace your type_string
macros within datatypes.sql
with this instead (if you want max_characters
to be a required parameter):
{%- macro string_type(max_characters) -%}
{{ return(adapter.dispatch('string_type', 'snowplow_utils')(max_characters)) }}
{%- endmacro -%}
{% macro default__string_type(max_characters) %}
{{ api.Column.string_type(max_characters) }}
{% endmacro %}
or something like this (if you want max_characters
to be optional and default to the max size for Redshift specifically):
{%- macro string_type(max_characters=none) -%}
{{ return(adapter.dispatch('string_type', 'snowplow_utils')(max_characters)) }}
{%- endmacro -%}
{% macro default__string_type(max_characters=none) %}
{%- max_characters is none -%}
{{ dbt.type_string() }}
{% else %}
{{ api.Column.string_type(max_characters) }}
{% endif %}
{% endmacro %}
{% macro redshift__string_type(max_characters=65535) %}
{{ api.Column.string_type(max_characters) }}
{% endmacro %}
Thanks for the very detailed response @dbeatty10! I wasn't aware of the api.Column
functions so I think it should be easy enough for us to swap out to that in places we need it.
So far as I can tell the only reason our version of type_string
even exists was to support that; we're working to clean up our utils
a bit (especially as we're going to require people dispatch to it over core dbt soon to support some incremental optimisations) so I can see us removing it in a few versions!
As for the questions around larger support for types, as you said they're both tricky and fun, so I'll leave you to discuss that!
@dbeatty10 I won't reopen the issue as it's really more suited for the specific adaptor, but I've discovered we will still need a macro for wher 256 characters isn't suitable as api.Column.string_type
doesn't seem to be overwritten by at least bigquery and therefore generates an invalid query, which means we can't use it in cross-db models unfortunately.