Velir/dbt-ga4

Combining events and events_intraday data where GA4 batch export exceeds quota

Closed this issue · 7 comments

A GA4 property with both batch and streaming exports populates the intraday tables in almost real time and then the batch tables are populated, hopefully overnight, and the equivalent streaming partition gets deleted.

The batch export, however, is limited to 1,000,000 events per day. When a batch export exceeds that 1 million event threshold, the basic events partition is not created and the intraday partition that holds the streaming data remains in place.

Our current approach to combining the tables bolts on everything in the intraday table. It would be better to combine these tables in a more robust way as unioning the intraday partitions to the events data results in all intraday partitions getting reprocessed.

One possibility is to not distinguish between events_* and events_intraday_*. The issue with this approach is that I have heard, but not witnessed, that there can be a period of time where equivalent partitions exist on both the events_* and events_intraday_* tables.

I'd love to hear other ideas for how to combine older intraday tables in such a way as they do not need to be reprocessed.

@willbryant and @adamribaudo-velir, I'd like your opinions on this issue.

I think it might be best for us to remove the frequency variable and the ``base_ga4__events_intradaymodel entirely and treat allevents_` and `events_intraday_` tables the same.

Removing the distinction between those tables would resolve this issue. It might simplify the code. And it would make the package easier to use.

It's not the sort of thing that I'd want to change without discussion first.

Keen, please see #80 - specifically my proposed resolution would get easier with the change you propose, I think?

I agree that handling the intraday table as a special condition in so many places is brittle and I'd be fine dropping the 'frequency' variable if we can find an elegant solution. I just wish I had a better handle on the variations that we need to support. From what I understand, there are 4:

  1. Daily only
  2. Daily + Streaming. The intraday table is dropped and converted to a daily shard
  3. Daily + Streaming when some days have >1M events and no daily shard is written
  4. Streaming only

I don't have any examples of 3 & 4 to work with which makes me uneasy in producing a solution.

Based on what I know now, what a solution may look like is this:

  • Select from daily and intraday tables into our base model and depend on our dedupe clause to remove duplicate rows
  • Require users to use 'static incremental days' setting to ensure any day with partial data is reprocessed

@willbryant regarding your solution in #80, I think @dgitis was saying that when events are >1M then an intraday shard is created and not removed. So your solution to find event_* > some integer could pick up very old event_intraday_XXXXX shards.

Anyway, I created a PR that runs here but I'm not sure if I'm oversimplifying things #188

PR #188 is pretty much what I'd envisioned but I think maybe we can do one query for both events and events_intraday as suggested by @willbryant a comment in #80.

I don't have any sites that fulfill example 3 conditions, but I've seen screenshots posted on Measure Slack.

I've got plenty of experience with example 4. I will test it.

I believe variation 5 would be multi site that mixes sites using variations 1-4.

I'm thinking that the _dbt_max_partition macro could cause an issue if GA4 is late two days and you have streaming and batch exports set up where the yesterday and two days prior gets built from intraday data so when today's run goes, _dbt_max_partition returns yesterday's date leaving two days prior built from intraday (admittedly, from a full day of intraday data so its not terrible but not ideal either).

Yea my concern with the approach in #80 was that it would pick up old intraday tables. But if someone can get me a screenshot of what scenario 3 looks like, I could recreate it by hand?

And yea, variations 5-8 would be the same as 1-4 but for multi-site. So 8 variations to test 😕

There was this comment in #80.

eg. let's say I ran the pipeline last night, 2022-10-26. The last non-intraday table at that point was the previous day's events_20221025. I now want a query that will pick up 2022-10-26 (and 2022-10-27 when it comes), whether the 26th is in events_intraday_20221026 or events_20221026.

So I do this by querying events_* with a WHERE _TABLE_SUFFIX > '20221025' condition.

(I don't need to use WHERE _TABLE_SUFFIX > '20221025' OR _TABLE_SUFFIX > 'intraday_20221025' because the i of intraday is > the 2 of 20221025, so the one condition works to find both.)

@willbryant is matching both events and events intraday tables with just the _table_suffix date condition which should prevent old intraday tables from being picked up.

I think if we test variation 3 against multi-site, then we'll have covered everything adequately. The issue with multi-site is that you can have multiple properties configured differently (we don't support this, but it could happen). Supporting variation 3 for multi-site is likely the equivalent of having a mix of sites on 1,2, and 4.

I tried searching Measure Slack and couldn't find a screenshot. Google's documentation does confirm that the situation for example 3 is mostly as I described.

It's not clear, from the documentation, that the intraday partitions are not deleted. I'm pretty sure I've seen it somewhere that intraday partitions persist when batch partitions are not created. Perhaps we should ask Measure Slack for confirmation.

dgitis commented

This was resolved by #188