Configure index and dist for seed files
BerndSchrooten opened this issue · 2 comments
I'm looking to configure tables created from seed files with a HEAP index and REPLICATE distribution, adding this to the seed config does not seem to work like it does for models.
version: 2
seeds:
- name: country_codes
config:
index: HEAP
dist: REPLICATE
The index and distribution of the resulting table can be retrieved using following SQL queries:
SELECT i.name AS index_name
,i.type_desc
FROM sys.indexes AS i
INNER JOIN sys.data_spaces AS ds ON i.data_space_id = ds.data_space_id
WHERE is_hypothetical = 0 AND i.index_id <> 0
AND i.object_id = OBJECT_ID('main.country_codes');
GO
SELECT
OBJECT_SCHEMA_NAME( object_id ) schemaName,
OBJECT_NAME( object_id ) tableName,
*
FROM sys.pdw_table_distribution_properties
WHERE OBJECT_NAME( object_id ) = 'country_codes'
Outcome
The table is created with a CLUSTERED COLUMNSTORE index and ROUND_ROBIN distribution which are the default
Expected outcome
The created table has a HEAP index and REPLICATE distribution
A workaround is to treat the seeded table as a staging table and recreate a new table using a model with the desired configuration.
solutioin: override the dbt-core's global_project/macros/materializations/seeds/helpers.sql
's default__create_csv_table
, synapse__create_csv_table
. this new version should fetch specified configuration (and use sensible defaults) similar to how synapse__create_table_as
works.
helpers.sql
default https://github.com/dbt-labs/dbt-core/blob/0721f2c1b70eca233326cdd7d3de755924415b20/core/dbt/include/global_project/macros/materializations/seeds/helpers.sql#L6
PR #138
You can now create seed tables with different distribution and index strategy by providing required confiuration in dbt_project.yml file. The default choice is REPLICATE disttribution and HEAP (no indexing). If you want to override this configuration, the following sample should help.
seeds:
jaffle_shop:
index: HEAP
dist: ROUND_ROBIN
raw_customers:
index: HEAP
dist: REPLICATE
raw_payments:
dist: HASH(payment_method)
index: CLUSTERED INDEX(id,order_id)
Create a new context "seeds:" at the root followed by project name and seed name. In this case the project name is jaffle_shop and seeds are raw_customers and raw_payments. Provide index and distribution values using index and dist keys. Use replicate, round_robin, hash({column name}) as a value. Example: dist: replicate. The raw_customers seed table will be replicated a table. For hash distribution, the user need to provide the vaule HASH(payment_method). Example: dist: hash(payment_method)
To specific index, index as a key and CLUSTERED INDEX({Column1, Column2}), HEAP, CLUSTERED COLUMNSTORE INDEX as a value. Example: index: HEAP. The raw_customers seed table will use heap index strategy. For clustered index, the user need to provide one or more columns to create clustered index on. Example: index: CLUSTERED INDEX(id,order_id). The default value of index and distribution can also be set for all seeds under project name.