Raccoon needs to add ingestion time to every event
chakravarthyvp opened this issue · 2 comments
Problem
We (GoJek) use Raccoon currently to source clickstream events from the gojek app. The concrete product proto contains an event_timestamp
field which the downstream systems such as DWH can use to partition the data on. However we see some amount of data arrives in partitions in future dates while some other arrive at different days for the same event timestamp date. There are 2 scenarios that causes this issue:
- The time/clock in the mobile app is reset by the user to a future date
- The app was inactive and those events were sent at a later point of time by the mobile sdk
Is there any workaround?
The DWH can partition based on a field which is like an ingestion time into the warehouse. However this needs backfills & repartitions on existing data and the upstream applications may need to change the way they query.
What is the impact?
Upstream applications' & services' query returns erroneous results
Which version was this found?
NA
Solution
Raccoon needs to provide an ingestion time for each event. The ingestion time should be considered as the time it was ingested into raccoon. This enables DWH to partition data based on the ingestion time as an alternate option to event_timestamp.
I'm assuming that you mean that the ingestion time sits between the event time and the processing time, if yes, how would you want the ingestion time to be integrated into raccoon
? Are we supposed to add the field in the product proto itself, or are we supposed to do something else?
Apologies for the newbie question, I don't have any professional experience working with real-time data, I'm just a student.😓
@burntcarrot - You have a very valid question. The ingestion time
should be part of the product proto, which is serialised by the client, as bytes
in the Event.proto
. Since Raccoon is event agnostic, injecting a timestamp in the product would mean
that the product proto needs to be deserialized first in Raccoon and this breaks the architectural principle. What Raccoon needs is a generic metadata proto that the product protos need to compose and in this way Raccoon can deserialise using these generic protos and inject this timestamp.
Have you other suggestions?