MeltanoLabs/Singer-Working-Group

Standardization of `ACTIVATE_VERSION` message

dmosorast opened this issue · 6 comments

This is something that's been frequently buzzing about the community. The "activate version" message has been used by taps in the past to track the "version" of some data. The semantics of this message haven't been formalized, so I'd like to track that conversation here, and provide my historical knowledge on the subject.

I don't have a clear picture pulled together just yet, but just getting this topic pinned for now.

@dmosorast Are there any signals earlier in the stream that the target can use to predict if the stream is later going to receive an ACTIVATE_VERSION message? This could have influence on how the rather chooses to store records during the sync operation.

@aaronsteers That's a good point. I would say that part of the way we have implemented this in existing taps is that a tap that uses it should always send an ACTIVATE_VERSION message if it doesn't have a version saved in the stream's state. The effect of this as a signal to the target would be what you're suggesting, as well as a notification when a stream's state has been reset, and thus a new historical extraction is coming down the pipe.

I haven't had time to sit down, but the gist of my thoughts for outlining purposes here (while I'm here 😄 ) are:

  • The ACTIVATE_VERSION message is a low-level control signal
  • It requires a historical tracker in the target of which version is the "active" version
  • It does not dictate what the target will do when a new version becomes active
    • The general behavior is a truncate, but if the target is tracking historical datasets or something like that, a non-destructive method is acceptable
  • The version number must be strictly monotonically increasing (millisecond epoch timestamp is used for simplicity)
  • Going backwards in version is not allowed (time only moves forward)
  • Getting the same version activated over and over is a noop

Because it's a low-level signal, it doesn't support any specific feature, but more of a specific behavior which the target can take in the context of its own destination location. That said, the most common is to bring hard deletes to an EL flow when a full dataset refresh is required.

Super helpful to better understand the details here, @dmosorast. Thanks for this.

To answer my question from above in this context:

Are there any signals earlier in the stream that the target can use to predict if the stream is later going to receive an ACTIVATE_VERSION message?

  • Yes, if a target chooses to after the first ACTIVATE_VERSION message is received for a stream, it can permanently mark the table "versioned" for purposes of how to process future rows.
    • An easy and common heuristic for this is by noting the presence of an _sdc_table_version column in the target table, which the target would create for versioned tables.

Side note: If the tap changes upstream to no longer send ACTIVATE_VERSION messages, then target table (or other tracking mechanism) may need to be reconfigured or recreated. But this is a niche case for target developers, which I don't think we need to solve for here.

I agree. The details of what the target developer is required to do to react to a version message or how to track this state is something that I'd like to keep from specifying. This is just a sort of state-like stream-level metadata.

The question of what to do if a table becomes non-versioned. That's probably what I would call a breaking change (like a PK change, or a column datatype change), which requires some case-specific means of handling. In this scenario, if a stream is known to no longer be "versioned", I would expect something like deleting the target table and backfilling with the tap to be the approach to get it "non-versioned", but that scenario is rather awkward.

Honestly, there's likely no harm in just tracking a version for every stream, it's just not something that has been common, since many SaaS sources don't allow hard deletes, so we've chosen to not add the complexity in those cases.

The question of what to do if a table becomes non-versioned. That's probably what I would call a breaking change (like a PK change, or a column datatype change), which requires some case-specific means of handling.

Agreed. 👍 Very well put.

Linking to related item #8, which would allow taps and targets to advertise that they support ACTIVATE_VERSION messages.