microsoft/dbt-synapse

Incremental table creation has unneccessary indexes, write temp table as a HEAP

alittlesliceoftom opened this issue · 1 comments

When you create an incremental table, this is generally a 2 step process:

  1. Create the initial table as a table.
  2. Create the incremental updates as new row inserts (possibly with a delete afterwards).

For both of these the adapter breaks this up into a few steps:

  1. Build a view for what's going on
  2. Create a temp table (not in tempdb)
  3. Move that table to be the final table/ insert from that table into the final table
  4. Clean up.

For the incremental steps, the temp table is created with the index and distribution settings of the final target table.

The best practice advice for transient data is to use a HEAP index. We should test if using this for the intermediate table would make the process faster. As currently there's a slow store to CCI (possibly ordered), but then that's ready and slowly stored to the second CCI.

I think this needs testing, but will potentially be a large speed up for users:

Example step that needs change:

  CREATE TABLE "database"."schema"."users_table_name__dbt_tmp"
    WITH(
      DISTRIBUTION = HASH(HashColumn),
      CLUSTERED COLUMNSTORE INDEX ORDER(OrderColumn)
      )
    AS (SELECT * FROM [schema].[users_table_name__dbt_tmp_temp_view])

Should be:

  CREATE TABLE "database"."schema"."users_table_name__dbt_tmp"
    WITH(
      DISTRIBUTION = /*Unsure here - could take the dist or do RR, RR can load the DMS*/
      HEAP
      )
    AS (SELECT * FROM [schema].[users_table_name__dbt_tmp_temp_view])

I'd love to understand the necessity for a temp table at all a bit better, why is this needed vs pulling straight from the view.