[CT-2253] [Feature] Add length variable to `type_string` for redshift

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Currently the type_string macro generates a string type across all warehouses, and for all except redshift this is generated with the maximum length (or a type that has no length argument in general). Redshift is instead generated with a fixed max length of 256, which is reasonable for many cases but sometimes a larger column is required.

This feature would support an optional argument to the type_string macro, which at a minimum should be applied for redshift, but could be added to warehouses that support fixed length varchars as well.

Describe alternatives you've considered

Overridding the macro, or being explicit about the type for redshfit only.

Who will this benefit?

Any redshift users who also build models across multiple warehoueses, but require string columns over 256 characters.

Are you interested in contributing this feature?

A bit beyond me I think

Anything else?

If this is done it may be better to also have the default redshift be 65535 given that all other warehouses by default give a max-length string column instead.

Code location:

dbt-core/core/dbt/include/global_project/macros/utils/data_types.sql

Lines 1 to 15 in 8019498

    
           {# string  -------------------------------------------------     #} 
        
           {%- macro type_string() -%} 
        
             {{ return(adapter.dispatch('type_string', 'dbt')()) }} 
        
           {%- endmacro -%} 
        
           {% macro default__type_string() %} 
        
               {{ return(api.Column.translate_type("string")) }} 
        
           {% endmacro %} 
        
           -- This will return 'text' by default 
        
           -- On Postgres + Snowflake, that's equivalent to varchar (no size) 
        
           -- Redshift will treat that as varchar(256)

Which calls from

dbt-core/core/dbt/adapters/base/column.py

Line 11 in 8019498

"STRING": "TEXT",

Redshift string datatypes: https://docs.aws.amazon.com/redshift/latest/dg/r_Character_types.html#r_Character_types-text-and-bpchar-types

Thanks for opening this issue @rlh1994 ! Cross-database datatypes are both tricky and fun 🤩

TL;DR

Have you tried something like this:

cast(null as {{ api.Column.string_type(4096) }}) model,

instead of this:

cast(null as {{ snowplow_utils.type_string(4096) }}) model,

?

More detail below including a review of the available methods + some other ideas for you.

What we have today

We have two different groups of macros related to column types that are available in dbt-core today:

api.Column.*_type() macros
dbt.type_*() macros

`api.Column.*_type()` macros

Accept size/precision/scale parameters and return a data type with size/precision/scale included:

There are static methods within the Column class (Python). See here for some usage examples.

Pros

These are useful for enforcing a specific size/precision/scale when casting data

Cons

These are not suitable for direct comparison with column data types from metadata from the information_schema.columns view that many databases support.

`dbt.type_*()` macros

Accept no parameters and return the name of a data type without size, precision, or scale:

Pros

These may be useful when casting data (see con below)
These may be suitable for direct comparison with column data types from metatadata from the information_schema.columns view that many databases support. It depends on the adapter implementation.

Cons

These are not be useful enforcing a specific size/precision/scale when casting data

What we could consider for the future

Ideas for your use case

Your mileage may vary for all of the ideas presented below. In fact, I'd expect for you to tweak nearly all of them to fit your specific needs.

If you want an explicit text length of 1024 that should work across adapters, you could try string_type(), like this:

{{ api.Column.string_type(1024) }}

Alternatively, you could replace your type_string macros within datatypes.sql with this instead (if you want max_characters to be a required parameter):

{%- macro string_type(max_characters) -%}
  {{ return(adapter.dispatch('string_type', 'snowplow_utils')(max_characters)) }}
{%- endmacro -%}

{% macro default__string_type(max_characters) %}
    {{ api.Column.string_type(max_characters) }}
{% endmacro %}

or something like this (if you want max_characters to be optional and default to the max size for Redshift specifically):

{%- macro string_type(max_characters=none) -%}
  {{ return(adapter.dispatch('string_type', 'snowplow_utils')(max_characters)) }}
{%- endmacro -%}

{% macro default__string_type(max_characters=none) %}
    {%- max_characters is none -%}
      {{ dbt.type_string() }}
    {% else %}
      {{ api.Column.string_type(max_characters) }}
    {% endif %}
{% endmacro %}

{% macro redshift__string_type(max_characters=65535) %}
    {{ api.Column.string_type(max_characters) }}
{% endmacro %}

Thanks for the very detailed response @dbeatty10! I wasn't aware of the api.Column functions so I think it should be easy enough for us to swap out to that in places we need it.

So far as I can tell the only reason our version of type_string even exists was to support that; we're working to clean up our utils a bit (especially as we're going to require people dispatch to it over core dbt soon to support some incremental optimisations) so I can see us removing it in a few versions!

As for the questions around larger support for types, as you said they're both tricky and fun, so I'll leave you to discuss that!

Sounds great @rlh1994 !

@dbeatty10 I won't reopen the issue as it's really more suited for the specific adaptor, but I've discovered we will still need a macro for wher 256 characters isn't suitable as api.Column.string_type doesn't seem to be overwritten by at least bigquery and therefore generates an invalid query, which means we can't use it in cross-db models unfortunately.

Will also be interested to see the macro(s) you land on!

Thanks for letting me know about dbt-bigquery -- would you be willing to open a bug report here, by any chance?

	{# string ------------------------------------------------- #}

	{%- macro type_string() -%}
	{{ return(adapter.dispatch('type_string', 'dbt')()) }}
	{%- endmacro -%}

	{% macro default__type_string() %}
	{{ return(api.Column.translate_type("string")) }}
	{% endmacro %}

	-- This will return 'text' by default
	-- On Postgres + Snowflake, that's equivalent to varchar (no size)
	-- Redshift will treat that as varchar(256)

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

TL;DR

What we have today

api.Column.*_type() macros

Pros

Cons

dbt.type_*() macros

Pros

Cons

What we could consider for the future

Ideas for your use case

`api.Column.*_type()` macros

`dbt.type_*()` macros