microsoft/dbt-synapse

Configure index and dist for seed files

BerndSchrooten opened this issue · 2 comments

I'm looking to configure tables created from seed files with a HEAP index and REPLICATE distribution, adding this to the seed config does not seem to work like it does for models.

version: 2

seeds:
  - name: country_codes
    config:
      index: HEAP
      dist: REPLICATE

The index and distribution of the resulting table can be retrieved using following SQL queries:

SELECT i.name AS index_name  
    ,i.type_desc  
FROM sys.indexes AS i  
INNER JOIN sys.data_spaces AS ds ON i.data_space_id = ds.data_space_id  
WHERE is_hypothetical = 0 AND i.index_id <> 0   
AND i.object_id = OBJECT_ID('main.country_codes');  
GO

SELECT 
       OBJECT_SCHEMA_NAME( object_id ) schemaName, 
       OBJECT_NAME( object_id ) tableName,
       *
FROM sys.pdw_table_distribution_properties
WHERE OBJECT_NAME( object_id ) = 'country_codes'

Outcome
The table is created with a CLUSTERED COLUMNSTORE index and ROUND_ROBIN distribution which are the default

Expected outcome
The created table has a HEAP index and REPLICATE distribution

A workaround is to treat the seeded table as a staging table and recreate a new table using a model with the desired configuration.

solutioin: override the dbt-core's global_project/macros/materializations/seeds/helpers.sql's default__create_csv_table, synapse__create_csv_table. this new version should fetch specified configuration (and use sensible defaults) similar to how synapse__create_table_as works.
helpers.sql
default https://github.com/dbt-labs/dbt-core/blob/0721f2c1b70eca233326cdd7d3de755924415b20/core/dbt/include/global_project/macros/materializations/seeds/helpers.sql#L6

PR #138

You can now create seed tables with different distribution and index strategy by providing required confiuration in dbt_project.yml file. The default choice is REPLICATE disttribution and HEAP (no indexing). If you want to override this configuration, the following sample should help.

seeds:
  jaffle_shop:
    index: HEAP
    dist: ROUND_ROBIN
    raw_customers:
      index: HEAP
      dist: REPLICATE
    raw_payments:
      dist: HASH(payment_method)
      index: CLUSTERED INDEX(id,order_id)  

Create a new context "seeds:" at the root followed by project name and seed name. In this case the project name is jaffle_shop and seeds are raw_customers and raw_payments. Provide index and distribution values using index and dist keys. Use replicate, round_robin, hash({column name}) as a value. Example: dist: replicate. The raw_customers seed table will be replicated a table. For hash distribution, the user need to provide the vaule HASH(payment_method). Example: dist: hash(payment_method)

To specific index, index as a key and CLUSTERED INDEX({Column1, Column2}), HEAP, CLUSTERED COLUMNSTORE INDEX as a value. Example: index: HEAP. The raw_customers seed table will use heap index strategy. For clustered index, the user need to provide one or more columns to create clustered index on. Example: index: CLUSTERED INDEX(id,order_id). The default value of index and distribution can also be set for all seeds under project name.