Deduplicating transactions
Closed this issue · 4 comments
GA4 by default deduplicates purchases with the same transaction_id
. We should probably do the same.
I'm currently chasing down a bunch of duplicate transactions and I'm noticing a mix of implementation issues, users reloading the confirmation page, and bots that seem to want to crawl these pages several months after the fact.
I think, for performance reasons, we should just commit to deduplicating transactions within the period being processed and not look back beyond that.
We could do this in stg_ga4__event_purchase
, but we would need to unnest the transaction_id
from the items array to do it there. Another option is to create a sample fct_ga4__event_purchase
table that people can enable and fix it there.
We may also want to consider removing the duplicates from the items table.
Additionally, purchase events where transaction_id
is null should not be tracked as these are invalid according to Google, but they come through to the BigQuery export and currently get processed by our package.
Regardless of what we choose, I think we want to keep the purchase with the oldest timestamp.
I think I do the most ecommerce work, but I'd like @adamribaudo and @willbryant to weigh in here. Adam, in particular, has preferred to keep our event staging models consistent and one of the options here involves quite a few changes to one of those models.
@dgitis quick note: the transaction_id
is part of the normal event params, not the items array. We already unnest it in stg_ga4__event_purchase.sql
.
Seems like it would be straight-forward enough to create a stg_ga4__event_purchase_deduped
model on top that uses a window function over the incremental load to dedupe. That sound good?
I presume that we would put this in the recommended-events folder and set it to disabled by default and then adjust the recommended events documentation accordingly.
I think we should do this for the stg_ga4__event_items
model as well possibly using a check for whether the deduped table exists and use the deduped table for purchase items if it does while leaving all other ecommerce event items unchanged.
I'm going to edit this issue and add one removing transaction_id
is null. Those are invalid purchase events according to Google, but they seem to come through the raw data export just fine.
I'll be working on this, finally, this week.