meltano/sdk

feat: incremental sync is stuck replicating everything due to in frequent bulk updates

pnadolny13 opened this issue · 0 comments

Similar to #1200

I've run into several use cases over the past few years where a replication job is syncing using something like an updated_at timestamp and due to a bulk update the entire dataset is updated at the same time and the timestamp matches across all records. In this case an incremental sync is first triggered and all records are replicated, then that timestamp is saved in the state file, on the next incremental sync nothing has updated yet so once again we replicate the entire dataset. This is due to the >= logic thats recommended in the singer community. Every records matches the = logic.

Is there a way that we could solve this? Maybe we keep track of the previous successful sync bookmark so we can toggle between >= and > if we already ran a sync using that bookmark. I think this still risks the tie situation where new records arrived exactly on that bookmark value since the last sync. Although if its a timestamp then maybe we can combine it with the timestamp of the last sync time to know that we've already synced that bookmark and that bookmark is in the past so its safe to assume that all ties were captured.

Situation:

  • On 2024-01-17T01:00:00+00:00 we ran a sync and the bookmark was 2024-01-01T01:00:00+00:00
  • All records get bulk updated in the system
  • The next day on 2024-01-18T01:00:00+00:00 we run a sync with >= 2024-01-01T01:00:00+00:00 logic. Everything gets replicated. And the bookmark is exactly 2024-01-18T01:00:00+00:00, the same as our runtime which opens up the possibility for ties.
  • Nothing in the system is updated
  • The next day on 2024-01-19T01:00:00+00:00 we see our bookmark 2024-01-18T01:00:00+00:00 and previous runtime 2024-01-18T01:00:00+00:00 are too close to be sure there arent ties. So we continue using >= logic, everything is synced and bookmark 2024-01-18T01:00:00+00:00 is once again saved unchanged.
  • Nothing in the system is updated
  • The next day on 2024-01-20T01:00:00+00:00 we see our bookmark 2024-01-18T01:00:00+00:00 matches the previous successful sync time and the previous successful sync time was run sufficiently after the bookmark value so we know all ties are handled. We attempt a sync with only > logic and no records are synced...success!!

Its a bit hand wavy but I wonder if we could make something like this work.

cc @edgarrmondragon