dbt-labs/dbt_metrics

[SEMANTIC-106] [Feature] Swap from subquery to CTE for base query

Closed this issue · 2 comments

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

When looking through the compiled code of metrics.calculate(), the base query is a subquery nested under the aggregate cte. Since this uses a combination of subqueries and CTEs it's hard to look through the code and see exactly what's going on.

I'd like to propose that we move the subquery into it's own CTE, then call that in the aggregate CTE.

Current state:

with calendar as (

    select 
        * 
    from analytics.dbt_bregenold.dbt_metrics_default_calendar
     
)



, total_order_item_cost__aggregate as (
    select
        date_month,
        ship_mode,
        sum(property_to_aggregate) as total_order_item_cost,
        boolor_agg(metric_date_day is not null) as has_data
        

    from (

    
        select 
        
            cast(base_model.ship_date as date) as metric_date_day, -- timestamp field
            calendar_table.date_month as date_month,
            
            calendar_table.date_day as window_filter_date,
            

            
                base_model.ship_mode,
                (supplier_cost) as property_to_aggregate

        from analytics.dbt_bregenold.fct_order_items base_model 
        left join analytics.dbt_bregenold.dbt_metrics_default_calendar calendar_table

        
            on cast(base_model.ship_date as date) = calendar_table.date_day
        

        where 1=1
    ) as base_query

    where 1=1

    

    
        group by
        1, 2



)

Proposed state:

with calendar as (

    select 
        * 
    from analytics.dbt_bregenold.dbt_metrics_default_calendar
     
)


, base_query as (
    

    
        select 
        
            cast(base_model.ship_date as date) as metric_date_day, -- timestamp field
            calendar_table.date_month as date_month,
            
            calendar_table.date_day as window_filter_date,
            

            
                base_model.ship_mode,
                (supplier_cost) as property_to_aggregate

        from analytics.dbt_bregenold.fct_order_items base_model 
        left join analytics.dbt_bregenold.dbt_metrics_default_calendar calendar_table

        
            on cast(base_model.ship_date as date) = calendar_table.date_day
        

        where 1=1
)

, total_order_item_cost__aggregate as (
    select
        date_month,
        ship_mode,
        sum(property_to_aggregate) as total_order_item_cost,
        boolor_agg(metric_date_day is not null) as has_data
        

    from base_query

    where 1=1

    

    
        group by
        1, 2



)

Describe alternatives you've considered

It works as it, but this change will make it easier to read.

Who will this benefit?

Anyone who looks through the compiled code for metrics.

Are you interested in contributing this feature?

Yes, I'm happy to submit this change.

Anything else?

No response

I initially thought this was straightforward, but it looks like if you call multiple metrics in metrics.calculate the base query is different for each (the property_to_aggregate field changes). This means that this can't be a single cte named base_query, but could be named after the metric similar to the aggregate table.

It would look like this instead:

{{metric_dictionary.name}}__base_query as (
    select ....
)

👋 @bennieregenold7 !

We actually experimented around with this during our initial re-write of the metrics package for v0.3.0 and found that this query structure had significant performance savings over CTE's! Now granted at that time we didn't experiment with exactly the format that you suggested but it provided enough of a performance increase that we decided to go with that format.

That being said we've also got an open ticket for #114 that is somewhat related to this and multiple metrics. Additionally we don't have an open issue for it yet but we're planning some foundational changes to the semantic layer that will require a rewrite of our default structure.

All that being said, I think we'll be punting on this and potentially wrap it into a broader base of work around making the sql queries simpler/faster.