Velir/dbt-ga4

Enabling the Fresh Daily export in GA4 breaks base_ga4__events model

arifsenolwp opened this issue · 3 comments

There are 3 options in GA4 to export data to Bigquery -> Daily, Fresh Daily and Streaming.

Enabling Daily creates sharded tables called events_*
Enabling Fresh Daily creates sharded tables called events_fresh_*
Enabling Streaming creates sharded tables called events_intraday_*

base_ga4__events model tries to parse the table names under the same dataset provided in config files and it does not cover events_fresh_* ones.

This condition fails with Bad int64 value: fresh_xx ;

where cast(left(replace(_table_suffix, 'intraday_', ''), 8) as int64) >= {{var('start_date')}}
    {% if is_incremental() %}
        and parse_date('%Y%m%d', left(replace(_table_suffix, 'intraday_', ''), 8)) in ({{ partitions_to_replace | join(',') }})
    {% endif %}

Since we can not change the destination dataset for the Fresh Daily export, enabling it breaks the model.

Either we can include the fresh_ keyword in parsing logic or use variables to decide what to replace , include or exclude.

Including fresh_ keyword in parsing logic may not be a good idea though, since the event_params seems to be different for some rows with intraday_ and fresh_ tables, this may cause duplication issues.

Thanks for reporting this. I don't think any of the regular contributors to this package have access to the fresh export so we're going to have to rely on you for both testing and logic.

Looking at the documentation, we have this:

The Fresh Daily export is updated approximately every 60 minutes throughout the day.

So, based on the documentation, you are correct about duplicate events.

I think the ideal behavior here is to detect the presence of the Fresh Daily export, get the max timestamp value and then use streaming rows for values above that timestamp.

I think the place to achieve this is in the combine_property_data macro. I've got some fairly major changes to this macro in a draft PR that I actually can properly test now so I recommend waiting for me to get that across the finish line before working on implementing new logic in the macro.

I think the short-term solution is to exclude the Fresh Daily data so that the package works right away and let people choose to customize the package if they want to use Fresh Daily instead of Streaming.

@arifsenolwp, how long do Fresh Daily partitions persist? Do they delete when moved to Daily partitions?

The Fresh Daily export is not similar like the streaming one but the Daily one. I mean it is kind of Daily one which is updated every 60 minutes and always persists. It never gets delete just like the Daily one.

In this case I think having an option that choose whether Daily or Fresh one can be an option. The streaming one's been deleted every day ( once the Daily one becomes completed ) so there is no duplication issue after 2-3 days of that day.

So probably enabling just the Fresh one is viable for most of the users if they use this dbt package because Daily one and Streaming one are linked ( intraday/streaming one will be deleted only if you enable the Daily one ) and that is ok with our models here. What I mean is, there may be a yml config on which data table the package uses and which data table it should exclude. Daily and Steaming combination works but there is no Fresh and Streaming combination really.

Ignoring combine_property_data for a moment, is supporting "Fresh Daily" export as simple as updating base_ga4__events.sql to replace _fresh and _intraday to handle either case? In this line:

where cast(left(replace(_table_suffix, 'intraday_', ''), 8) as int64) >= {{var('start_date')}}

@arifsenolwp you mentioned some differences in the event params? What differences are there?