Combining events and events_intraday data where GA4 batch export exceeds quota
Closed this issue · 7 comments
A GA4 property with both batch and streaming exports populates the intraday tables in almost real time and then the batch tables are populated, hopefully overnight, and the equivalent streaming partition gets deleted.
The batch export, however, is limited to 1,000,000 events per day. When a batch export exceeds that 1 million event threshold, the basic events partition is not created and the intraday partition that holds the streaming data remains in place.
Our current approach to combining the tables bolts on everything in the intraday table. It would be better to combine these tables in a more robust way as unioning the intraday partitions to the events data results in all intraday partitions getting reprocessed.
One possibility is to not distinguish between events_*
and events_intraday_*
. The issue with this approach is that I have heard, but not witnessed, that there can be a period of time where equivalent partitions exist on both the events_*
and events_intraday_*
tables.
I'd love to hear other ideas for how to combine older intraday tables in such a way as they do not need to be reprocessed.
@willbryant and @adamribaudo-velir, I'd like your opinions on this issue.
I think it might be best for us to remove the frequency
variable and the ``base_ga4__events_intradaymodel entirely and treat all
events_` and `events_intraday_` tables the same.
Removing the distinction between those tables would resolve this issue. It might simplify the code. And it would make the package easier to use.
It's not the sort of thing that I'd want to change without discussion first.
Keen, please see #80 - specifically my proposed resolution would get easier with the change you propose, I think?
I agree that handling the intraday table as a special condition in so many places is brittle and I'd be fine dropping the 'frequency' variable if we can find an elegant solution. I just wish I had a better handle on the variations that we need to support. From what I understand, there are 4:
- Daily only
- Daily + Streaming. The intraday table is dropped and converted to a daily shard
- Daily + Streaming when some days have >1M events and no daily shard is written
- Streaming only
I don't have any examples of 3 & 4 to work with which makes me uneasy in producing a solution.
Based on what I know now, what a solution may look like is this:
- Select from daily and intraday tables into our
base
model and depend on our dedupe clause to remove duplicate rows - Require users to use 'static incremental days' setting to ensure any day with partial data is reprocessed
@willbryant regarding your solution in #80, I think @dgitis was saying that when events are >1M then an intraday shard is created and not removed. So your solution to find event_*
> some integer
could pick up very old event_intraday_XXXXX
shards.
Anyway, I created a PR that runs here but I'm not sure if I'm oversimplifying things #188
PR #188 is pretty much what I'd envisioned but I think maybe we can do one query for both events
and events_intraday
as suggested by @willbryant a comment in #80.
I don't have any sites that fulfill example 3 conditions, but I've seen screenshots posted on Measure Slack.
I've got plenty of experience with example 4. I will test it.
I believe variation 5 would be multi site that mixes sites using variations 1-4.
I'm thinking that the _dbt_max_partition
macro could cause an issue if GA4 is late two days and you have streaming and batch exports set up where the yesterday and two days prior gets built from intraday data so when today's run goes, _dbt_max_partition
returns yesterday's date leaving two days prior built from intraday (admittedly, from a full day of intraday data so its not terrible but not ideal either).
Yea my concern with the approach in #80 was that it would pick up old intraday tables. But if someone can get me a screenshot of what scenario 3 looks like, I could recreate it by hand?
And yea, variations 5-8 would be the same as 1-4 but for multi-site. So 8 variations to test 😕
There was this comment in #80.
eg. let's say I ran the pipeline last night, 2022-10-26. The last non-intraday table at that point was the previous day's
events_20221025
. I now want a query that will pick up 2022-10-26 (and 2022-10-27 when it comes), whether the 26th is inevents_intraday_20221026
orevents_20221026
.So I do this by querying
events_*
with aWHERE _TABLE_SUFFIX > '20221025'
condition.(I don't need to use
WHERE _TABLE_SUFFIX > '20221025' OR _TABLE_SUFFIX > 'intraday_20221025'
because thei
of intraday is>
the2
of20221025
, so the one condition works to find both.)
@willbryant is matching both events and events intraday tables with just the _table_suffix
date condition which should prevent old intraday tables from being picked up.
I think if we test variation 3 against multi-site, then we'll have covered everything adequately. The issue with multi-site is that you can have multiple properties configured differently (we don't support this, but it could happen). Supporting variation 3 for multi-site is likely the equivalent of having a mix of sites on 1,2, and 4.
I tried searching Measure Slack and couldn't find a screenshot. Google's documentation does confirm that the situation for example 3 is mostly as I described.
It's not clear, from the documentation, that the intraday partitions are not deleted. I'm pretty sure I've seen it somewhere that intraday partitions persist when batch partitions are not created. Perhaps we should ask Measure Slack for confirmation.