Teradata/dbt-teradata

Volatile Tables

DEVi1wing opened this issue · 4 comments

Describe the feature

Volatile tables are a great asset to Teradata since they can be an efficient way to break queries into intermediate steps while still benefiting from things like indices and statistics that subqueries do not have. We would love to be able to define a table as volatile with indices and stat collections in dbt SQL models and then allow the system to drop them once the session ends.

Describe alternatives you've considered

Creating separate tables with indices on them. We cannot drop them through DBT, however, when the work is complete and they clutter up the database.

Additional context

Volatile tables will afford significant performance enhancements when working with large numbers of rows.

Who will this benefit?

What kind of use case will this feature be useful for? Please be specific and provide examples, this will help us prioritize properly.
This will be useful for anyone leveraging dbt with Teradata. One of our current barriers to adoption is we can create procedures that allow for volatile tables and are very performant. This dbt implementation, however, allows us to achieve the same result but it takes much longer to process or it leaves intermediate tables behind that clutter up the database and need to be deleted outside of dbt.

Are you interested in contributing this feature?

Let us know if you want to write some code, and how we can help.

@DEVi1wing , thank you for the feature request. I've discussed adding support for volatile tables with several users of dbt-teradata. Technically speaking, volatile tables are supported. The following config should produce a volatile table:

{{
  config(
      materialized="table",
      table_kind="VOLATILE"
  )
}}

The problem might be with how dbt manages sessions. As far as I can tell, dbt creates/closes sessions liberally. Even if you run with a single thread, there will be many sessions involved.

Does your data indicate that a single-threaded dbt run with volatile tables will be more performant than many threads but no volatile tables?

@adamtwo

The config you provided is helpful, but the behavior is exactly what you alluded to. The table is created, the session ends, and the table is gone before we are able to reference it in subsequent SQL files. Is there a way to force the session to stay open so it does not evaporate? Is there a way to force DBT to potentially create the volatile table in the same SQL file as the file table will be generated?

As for your question, yes, we have taken to using volatile tables as a standard practice in our development as they have been much more performant in many scenarios. We can always work around it by creating standard tables and dropping indices on them, but we still have the cleanup challenge afterwards.

We may attempt to cache the connection in connections.py and reuse it for subsequent queries instead of creating a new one. This has a chance to work only if you use a single thread. We would need to talk to the dbt team as I'm not sure what it would do to error handling.

I understand that volatile tables are better than materialized ones. dbt gives you other options like ephemeral materializations and multi-threaded dag processing. So the question is the performance of volatile tables is better than parallel dag processing + appropriate table/view/ephemeral/incremental materializations.

Before we start exploring a solution that stretches the dbt interface, could you please share you dag (could be the one generated by dbt docs generate) and describe the flow and why view/ephemeral/incremental materializations are not sufficient options?

One example would be if you are using intermediate models which are scheduled at different times but are then added to target model (incremental model). Here it would be beneficial for performance to create multiple volatile tables with indexes but without having to write the data to tables and then cleaning them up.

An example

intermediate_source1 --> intermediate_source1_transform --> my_target_model (incremental) [run with tag:source1]
intermediate_source2 --> intermediate_source2_transform --> my_target_model (incremental) [run with tag:source2]

where the two sources are run 5 hours apart but need to be available as soon as possible for analysis. At the same time it would be nice to not have cleanup post-hooks on the downstream models. This is why volatile materialization would be nice.