[SEMANTIC-88] [Bug] window with time_grain=day extends beyond current date
Closed this issue · 15 comments
Is this a new bug in dbt_metrics?
- I believe this is a new bug in dbt_metrics
- I have searched the existing issues, and I could not find an existing issue for this bug
Current Behavior
Using a rolling window extends past the current date into the future.
The following:
- name: weekly_active_per_day
label: rolling weekly active users
description: "The number of active users in the last 7 days"
model: ref("fct_sign_ins")
timestamp: dim_date_day
time_grains: [day]
calculation_method: count_distinct
expression: dim_user_id
window:
count: 7
period: day
with the current date being 2022-11-18
, this query will result in the following output:
As can be seen, the query extends 7 days into the future.
Expected Behavior
The max date in the output would be the current date.
Steps To Reproduce
{% set my_metric_yml -%}
{% raw %}
metrics:
- name: weekly_active_per_day
label: rolling weekly active users
description: "The number of active users in the last 7 days"
model: ref("fct_sign_ins")
timestamp: dim_date_day
time_grains: [day]
calculation_method: count_distinct
expression: dim_user_id
window:
count: 7
period: day
{% endraw %}
{%- endset %}
select *
from {{ metrics.develop(
develop_yml=my_metric_yml,
metric_list=['weekly_active_per_day'],
grain='day'
)
}}
Relevant log output
No response
Environment
- dbt-adapter & version: dbt-snowflake==1.3.0
- dbt_metrics version: v1.3.1
Which database adapter are you using with dbt?
snowflake
Additional Context
Reason for this bug is probably caused in genbase_query.sql
? The date in the base_model is defined as being relative to calendar_table.date_day
instead of the reverse.
Hi @JoeryV ! You have powers of prophecy because this was actually identified in a previous issue #181 and fixed in #183. This will go out by EOD in the new 1.3.2 release.
The fix for this was changing the >=
to just >
for the first join so-as to not include the extra days looking back. I haven't tested this out on negative numbers though - would you be able to run on main
and see if that resolves your problem?
Hi @callum-mcdata,
Thanks for the quick reply!
I think that #181 was a separate issue from the one above. The issue here is that the window range is being extended into the future whereas #181 is about an extra day of history being considered bc of the inclusive >=
.
(The negative number bit was a mistake in the code above which I edited now)
Glancing over the code from #183, I don't this the above issue was solved yet.
Unfortunately, I cannot test it atm as we are using the docker image dbt-snowflake which is currently stuck at dbt-core
version 1.3.0
which in turn is incompatible with dbt-metrics
v1.3.2
it seems.
@JoeryV do you mind testing this again? The issue with 1.3.2
was around the Hub not picking up the newest version but we fixed that this morning!
I am a bit curious to know how future dates could be picked up if the number wasn't defined as a negative value. I'm abstracting a bit but the code should compile down to something like the following, which would give you 7 previous days of data up to the "current_date" of whatever the row was.
left join calendar_table
on 2022-11-18 > 2022-11-11
and 2022-11-18 <= 2022-11-18
So I'm not entirely sure how it would show future values if the value was set to 7
and not -7
@callum-mcdata First off, thank you for fixing the hub issue! 😄
I tried it with the new version and the issue persists. The reason for this issue is the following: calendar_table
has dates into the future. That results in the following query also being considered:
left join calendar_table
on 2022-11-18 > 2022-11-17
and 2022-11-18 <= 2022-11-24
and this evaluates to TRUE
I'm not sure what is causing it. Initially I thought it would be switching the <=
comparison to >=
but this doesn't fix the issue as it pulls the data to an earlier date. Maybe the combination of the has_data
column and the gen_filters
macro are involved?
Can you provide the SQL of the first two CTE's (aggregate & base)? We can take a look at the date joins and see what might be going on there that it would be throwing future dates into your dataset.
Side note: I'm not sure but from this comment pulls the data to an earlier date
it sort of sounds like you aren't looking for a cumulative window metric? Just to clarify 100%, you are looking for a metric that is aggregated over the range of 7 days correct?
Taking the last point first, I see how my wording is confusing but the window metric is intended.
What I meant is the following.
Running the query today (using <=
) will result in the following output:
Running the query with >=
in gen_calendar_table
instead will result in the following output:
i.e. "pulls the data to an earlier date"
To the first point, this is the entire compiled sql for my develop
metric:
select *
from -- depends on: analytics.dbt_jdevos.dbt_metrics_default_calendar
(
with calendar as (
select
*
from analytics.dbt_jdevos.dbt_metrics_default_calendar
),
weekly_active_users_per_day__aggregate as (
select
date_day,
count(distinct property_to_aggregate) as weekly_active_users_per_day,
boolor_agg(metric_date_day is not null) as has_data
from (
select
cast(base_model.dim_date_day as date) as metric_date_day, -- timestamp field
calendar_table.date_day as date_day,
calendar_table.date_day as window_filter_date,
(dim_user_id) as property_to_aggregate
from analytics.dbt_jdevos.fct_sign_ins base_model
left join analytics.dbt_jdevos.dbt_metrics_default_calendar calendar_table
on cast(base_model.dim_date_day as date) > dateadd(day, -7, calendar_table.date_day)
and cast(base_model.dim_date_day as date) <= calendar_table.date_day
where 1=1
) as base_query
where 1=1
and date_day = window_filter_date
group by
1
),
weekly_active_users_per_day__spine_time as (
select
calendar.date_day
from calendar
group by
1
),
weekly_active_users_per_day__final as (
select
parent_metric_cte.date_day,
coalesce(weekly_active_users_per_day, 0) as weekly_active_users_per_day
from weekly_active_users_per_day__spine_time as parent_metric_cte
left outer join weekly_active_users_per_day__aggregate
using (
date_day
)
where (
parent_metric_cte.date_day >= (
select
min(case when has_data then date_day end)
from weekly_active_users_per_day__aggregate
)
and parent_metric_cte.date_day <= (
select
max(case when has_data then date_day end)
from weekly_active_users_per_day__aggregate
)
)
),
secondary_calculations as (
select
*
from weekly_active_users_per_day__final
)
select *
from weekly_active_users_per_day__final
order by
1 desc
) metric_subq
This one is tricky - I am unable to recreate it on my local. Can you return the results of weekly_active_users_per_day__aggregate
and see if they look correct? I cannot think of a way that the result set could get returned like the screenshot you provided. That looks like a running cumulative calculation even beyond the range of 7 days, which seems impossible given the logic you provided.
Hi @callum-mcdata,
I am not entirely sure what you mean by weekly_active_users_per_day__aggregate
, but I'm guessing to rerun and verify the results? I have done so below
The date range is still being extended into the future.
@JoeryV sorry I wasn't clear - can you return the results of just the CTE weekly_active_users_per_day__aggregate
in the compiled SQL?
Additionally did you update your package to 1.3.2 or are you still on 1.3.1?
Ah, sorry I didn't catch that. See the compiled query below:
weekly_active_per_day__aggregate as (
select
date_day,
count(distinct property_to_aggregate) as weekly_active_per_day,
boolor_agg(metric_date_day is not null) as has_data
from (
select
cast(base_model.dim_date_day as date) as metric_date_day, -- timestamp field
calendar_table.date_day as date_day,
calendar_table.date_day as window_filter_date,
(dim_user_id) as property_to_aggregate
from analytics.dbt_jdevos.fct_sign_ins base_model
left join analytics.dbt_jdevos.dbt_metrics_default_calendar calendar_table
on cast(base_model.dim_date_day as date) > dateadd(day, -7, calendar_table.date_day)
and cast(base_model.dim_date_day as date) <= calendar_table.date_day
where 1=1
) as base_query
where 1=1
and date_day = window_filter_date
group by 1
)
We are on the 1.3.2 version of dbt_metrics
.
AHHH - I think I finally understand what you mean and where this is coming from! It's not that the data is wrong but more that it is representing dates in the future. Got it 🤦 . Not sure why it took me this long to understand what exactly was the issue.
All righty, so the reason that we're representing date's in the future is because the transaction on date 2022-01-01
is included in all lookbacks up until 2022-01-08
in a 7 day lookback. So that's how we structure the data in the join.
This is because metrics have no concept of current_date
- there are datasets where representations of data in the future is perfectly valid (think forecasts).
What I suspect you want is to use start_date
in the calculate macro to limit the date range to not display dates in the future. Right now we don't support date functions for this ( #27 ) but go add comments there if that is the correct issue!
Does that make sense?
@JoeryV following up on this one before I close out this issue!