dbt-labs/dbt_metrics

[SEMANTIC-88] [Bug] window with time_grain=day extends beyond current date

Closed this issue · 15 comments

Is this a new bug in dbt_metrics?

  • I believe this is a new bug in dbt_metrics
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Using a rolling window extends past the current date into the future.

The following:

  - name: weekly_active_per_day
    label: rolling weekly active users
    description: "The number of active users in the last 7 days"
    model: ref("fct_sign_ins")
    timestamp: dim_date_day
    time_grains: [day]
    calculation_method: count_distinct
    expression: dim_user_id
    window:
      count: 7
      period: day

with the current date being 2022-11-18, this query will result in the following output:
Screenshot 2022-11-18 at 16 58 52

As can be seen, the query extends 7 days into the future.

Expected Behavior

The max date in the output would be the current date.

Steps To Reproduce

{% set my_metric_yml -%}
{% raw %}

metrics:
  - name: weekly_active_per_day
    label: rolling weekly active users
    description: "The number of active users in the last 7 days"
    model: ref("fct_sign_ins")
    timestamp: dim_date_day
    time_grains: [day]
    calculation_method: count_distinct
    expression: dim_user_id
    window:
      count: 7
      period: day

{% endraw %}
{%- endset %}


select * 
from {{ metrics.develop(
        develop_yml=my_metric_yml,
        metric_list=['weekly_active_per_day'],
        grain='day'
        )
    }}

Relevant log output

No response

Environment

- dbt-adapter & version: dbt-snowflake==1.3.0
- dbt_metrics version: v1.3.1

Which database adapter are you using with dbt?

snowflake

Additional Context

Reason for this bug is probably caused in genbase_query.sql ? The date in the base_model is defined as being relative to calendar_table.date_day instead of the reverse.

Screenshot 2022-11-18 at 17 03 06

Hi @JoeryV ! You have powers of prophecy because this was actually identified in a previous issue #181 and fixed in #183. This will go out by EOD in the new 1.3.2 release.

The fix for this was changing the >= to just > for the first join so-as to not include the extra days looking back. I haven't tested this out on negative numbers though - would you be able to run on main and see if that resolves your problem?

Hi @callum-mcdata,

Thanks for the quick reply!

I think that #181 was a separate issue from the one above. The issue here is that the window range is being extended into the future whereas #181 is about an extra day of history being considered bc of the inclusive >=.
(The negative number bit was a mistake in the code above which I edited now)

Glancing over the code from #183, I don't this the above issue was solved yet.

Unfortunately, I cannot test it atm as we are using the docker image dbt-snowflake which is currently stuck at dbt-core version 1.3.0 which in turn is incompatible with dbt-metrics v1.3.2 it seems.

Screenshot 2022-11-21 at 12 22 16

@JoeryV do you mind testing this again? The issue with 1.3.2 was around the Hub not picking up the newest version but we fixed that this morning!

I am a bit curious to know how future dates could be picked up if the number wasn't defined as a negative value. I'm abstracting a bit but the code should compile down to something like the following, which would give you 7 previous days of data up to the "current_date" of whatever the row was.

left join calendar_table
      on 2022-11-18 > 2022-11-11
      and 2022-11-18 <= 2022-11-18

So I'm not entirely sure how it would show future values if the value was set to 7 and not -7

@callum-mcdata First off, thank you for fixing the hub issue! 😄

I tried it with the new version and the issue persists. The reason for this issue is the following: calendar_table has dates into the future. That results in the following query also being considered:

left join calendar_table
      on 2022-11-18 > 2022-11-17
      and 2022-11-18 <= 2022-11-24

and this evaluates to TRUE

I'm not sure what is causing it. Initially I thought it would be switching the <= comparison to >= but this doesn't fix the issue as it pulls the data to an earlier date. Maybe the combination of the has_data column and the gen_filters macro are involved?

Can you provide the SQL of the first two CTE's (aggregate & base)? We can take a look at the date joins and see what might be going on there that it would be throwing future dates into your dataset.

Side note: I'm not sure but from this comment pulls the data to an earlier date it sort of sounds like you aren't looking for a cumulative window metric? Just to clarify 100%, you are looking for a metric that is aggregated over the range of 7 days correct?

Taking the last point first, I see how my wording is confusing but the window metric is intended.
What I meant is the following.

Running the query today (using <= ) will result in the following output:
Screenshot 2022-11-23 at 09 37 26

Running the query with >= in gen_calendar_table instead will result in the following output:
Screenshot 2022-11-23 at 09 39 22

i.e. "pulls the data to an earlier date"

To the first point, this is the entire compiled sql for my develop metric:

select * 
from -- depends on: analytics.dbt_jdevos.dbt_metrics_default_calendar
    (
with calendar as (
    select 
        * 
    from analytics.dbt_jdevos.dbt_metrics_default_calendar     
), 

weekly_active_users_per_day__aggregate as (
    select
        date_day,
        count(distinct property_to_aggregate)       as weekly_active_users_per_day,
        boolor_agg(metric_date_day is not null)     as has_data
    from (  
        select 
            cast(base_model.dim_date_day as date)   as metric_date_day, -- timestamp field
            calendar_table.date_day                 as date_day,
            calendar_table.date_day                 as window_filter_date,
            (dim_user_id)                           as property_to_aggregate
        from analytics.dbt_jdevos.fct_sign_ins base_model 
        left join analytics.dbt_jdevos.dbt_metrics_default_calendar calendar_table
            on cast(base_model.dim_date_day as date) > dateadd(day, -7, calendar_table.date_day)
            and cast(base_model.dim_date_day as date) <= calendar_table.date_day
        where 1=1
    ) as base_query
    where 1=1    
        and date_day = window_filter_date
    group by
        1
), 

weekly_active_users_per_day__spine_time as (
    select
        calendar.date_day
    from calendar  
    group by
        1
), 

weekly_active_users_per_day__final as (  
    select
        parent_metric_cte.date_day,
        coalesce(weekly_active_users_per_day, 0) as weekly_active_users_per_day
    from weekly_active_users_per_day__spine_time as parent_metric_cte
    left outer join weekly_active_users_per_day__aggregate
        using (
            date_day
        )
    where (
        parent_metric_cte.date_day >= (
            select 
                min(case when has_data then date_day end) 
                from weekly_active_users_per_day__aggregate
        )
        and parent_metric_cte.date_day <= (
            select 
                max(case when has_data then date_day end) 
                from weekly_active_users_per_day__aggregate
        )
    )      
), 

secondary_calculations as (
    select 
        *
    from weekly_active_users_per_day__final
)

select * 
from weekly_active_users_per_day__final
order by
    1 desc
) metric_subq

This one is tricky - I am unable to recreate it on my local. Can you return the results of weekly_active_users_per_day__aggregate and see if they look correct? I cannot think of a way that the result set could get returned like the screenshot you provided. That looks like a running cumulative calculation even beyond the range of 7 days, which seems impossible given the logic you provided.

Hi @callum-mcdata,

I am not entirely sure what you mean by weekly_active_users_per_day__aggregate, but I'm guessing to rerun and verify the results? I have done so below

Calling the metric:
Screenshot 2022-11-29 at 21 14 31

Inspecting the data:
Screenshot 2022-11-29 at 21 18 03

The date range is still being extended into the future.

@JoeryV sorry I wasn't clear - can you return the results of just the CTE weekly_active_users_per_day__aggregate in the compiled SQL?

Additionally did you update your package to 1.3.2 or are you still on 1.3.1?

Ah, sorry I didn't catch that. See the compiled query below:

weekly_active_per_day__aggregate as (
    select
        date_day,
        count(distinct property_to_aggregate) as weekly_active_per_day,
        boolor_agg(metric_date_day is not null) as has_data
    from (
        select 
            cast(base_model.dim_date_day as date) as metric_date_day, -- timestamp field
            calendar_table.date_day as date_day,
            calendar_table.date_day as window_filter_date,
            (dim_user_id) as property_to_aggregate
        from analytics.dbt_jdevos.fct_sign_ins base_model 
        left join analytics.dbt_jdevos.dbt_metrics_default_calendar calendar_table
            on cast(base_model.dim_date_day as date) > dateadd(day, -7, calendar_table.date_day)
                and cast(base_model.dim_date_day as date) <= calendar_table.date_day
        where 1=1    
    ) as base_query
    where 1=1
    and date_day = window_filter_date    
    group by 1
)

We are on the 1.3.2 version of dbt_metrics.

AHHH - I think I finally understand what you mean and where this is coming from! It's not that the data is wrong but more that it is representing dates in the future. Got it 🤦 . Not sure why it took me this long to understand what exactly was the issue.

All righty, so the reason that we're representing date's in the future is because the transaction on date 2022-01-01 is included in all lookbacks up until 2022-01-08 in a 7 day lookback. So that's how we structure the data in the join.

This is because metrics have no concept of current_date - there are datasets where representations of data in the future is perfectly valid (think forecasts).

What I suspect you want is to use start_date in the calculate macro to limit the date range to not display dates in the future. Right now we don't support date functions for this ( #27 ) but go add comments there if that is the correct issue!

Does that make sense?

@JoeryV following up on this one before I close out this issue!

Hahah, thank you Callum! This was a good exercise in confusion. end_date with the functionality of #27 should solve the problem indeed. I'll add my support there.

Thanks for the clear up! All good to close the issue.