dbt-labs/dbt-core

[CT-2253] [Feature] Add length variable to `type_string` for redshift

Closed this issue · 5 comments

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Currently the type_string macro generates a string type across all warehouses, and for all except redshift this is generated with the maximum length (or a type that has no length argument in general). Redshift is instead generated with a fixed max length of 256, which is reasonable for many cases but sometimes a larger column is required.

This feature would support an optional argument to the type_string macro, which at a minimum should be applied for redshift, but could be added to warehouses that support fixed length varchars as well.

Describe alternatives you've considered

Overridding the macro, or being explicit about the type for redshfit only.

Who will this benefit?

Any redshift users who also build models across multiple warehoueses, but require string columns over 256 characters.

Are you interested in contributing this feature?

A bit beyond me I think

Anything else?

If this is done it may be better to also have the default redshift be 65535 given that all other warehouses by default give a max-length string column instead.

Code location:

{# string ------------------------------------------------- #}
{%- macro type_string() -%}
{{ return(adapter.dispatch('type_string', 'dbt')()) }}
{%- endmacro -%}
{% macro default__type_string() %}
{{ return(api.Column.translate_type("string")) }}
{% endmacro %}
-- This will return 'text' by default
-- On Postgres + Snowflake, that's equivalent to varchar (no size)
-- Redshift will treat that as varchar(256)

Which calls from

"STRING": "TEXT",

Redshift string datatypes: https://docs.aws.amazon.com/redshift/latest/dg/r_Character_types.html#r_Character_types-text-and-bpchar-types

Thanks for opening this issue @rlh1994 ! Cross-database datatypes are both tricky and fun 🤩

TL;DR

Have you tried something like this:

cast(null as {{ api.Column.string_type(4096) }}) model,

instead of this:

cast(null as {{ snowplow_utils.type_string(4096) }}) model,

?

More detail below including a review of the available methods + some other ideas for you.


What we have today

We have two different groups of macros related to column types that are available in dbt-core today:

  1. api.Column.*_type() macros
  2. dbt.type_*() macros

api.Column.*_type() macros

Accept size/precision/scale parameters and return a data type with size/precision/scale included:

There are static methods within the Column class (Python). See here for some usage examples.

Pros

  • These are useful for enforcing a specific size/precision/scale when casting data

Cons

  • These are not suitable for direct comparison with column data types from metadata from the information_schema.columns view that many databases support.

dbt.type_*() macros

Accept no parameters and return the name of a data type without size, precision, or scale:

Pros

  • These may be useful when casting data (see con below)
  • These may be suitable for direct comparison with column data types from metatadata from the information_schema.columns view that many databases support. It depends on the adapter implementation.

Cons

  • These are not be useful enforcing a specific size/precision/scale when casting data

What we could consider for the future

More questions than answers right now.

Should there be more than seven dbt.type_* macros?

  • There's only one type_timestamp macro, but the SQL standard incudes two different types (and Snowflake has three!).
  • There's currently no macros for type_time, type_date, type_array, type_json, etc

Should there be more than two api.Column.*_type() macros?

  • Maybe default to the value of the corresponding dbt.type_* macro, but accept parameters for relevant types like numeric, string, array, etc.

Should the api.Column.*_type() macros be aliased (or moved) to the dbt or adapter namespaces to enable usage like {{ dbt.string_type(4096) }} or {{ adapter.string_type(4096) }}?

  • My guess is no to moving, but still worth considering the trade-offs and consequences.

Ideas for your use case

Your mileage may vary for all of the ideas presented below. In fact, I'd expect for you to tweak nearly all of them to fit your specific needs.

If you want an explicit text length of 1024 that should work across adapters, you could try string_type(), like this:

{{ api.Column.string_type(1024) }}

Alternatively, you could replace your type_string macros within datatypes.sql with this instead (if you want max_characters to be a required parameter):

{%- macro string_type(max_characters) -%}
  {{ return(adapter.dispatch('string_type', 'snowplow_utils')(max_characters)) }}
{%- endmacro -%}

{% macro default__string_type(max_characters) %}
    {{ api.Column.string_type(max_characters) }}
{% endmacro %}

or something like this (if you want max_characters to be optional and default to the max size for Redshift specifically):

{%- macro string_type(max_characters=none) -%}
  {{ return(adapter.dispatch('string_type', 'snowplow_utils')(max_characters)) }}
{%- endmacro -%}

{% macro default__string_type(max_characters=none) %}
    {%- max_characters is none -%}
      {{ dbt.type_string() }}
    {% else %}
      {{ api.Column.string_type(max_characters) }}
    {% endif %}
{% endmacro %}

{% macro redshift__string_type(max_characters=65535) %}
    {{ api.Column.string_type(max_characters) }}
{% endmacro %}

Thanks for the very detailed response @dbeatty10! I wasn't aware of the api.Column functions so I think it should be easy enough for us to swap out to that in places we need it.

So far as I can tell the only reason our version of type_string even exists was to support that; we're working to clean up our utils a bit (especially as we're going to require people dispatch to it over core dbt soon to support some incremental optimisations) so I can see us removing it in a few versions!

As for the questions around larger support for types, as you said they're both tricky and fun, so I'll leave you to discuss that!

Sounds great @rlh1994 !

@dbeatty10 I won't reopen the issue as it's really more suited for the specific adaptor, but I've discovered we will still need a macro for wher 256 characters isn't suitable as api.Column.string_type doesn't seem to be overwritten by at least bigquery and therefore generates an invalid query, which means we can't use it in cross-db models unfortunately.

Will also be interested to see the macro(s) you land on!

Thanks for letting me know about dbt-bigquery -- would you be willing to open a bug report here, by any chance?