meltano/sdk

[Feature]: Support toggling `>=` incremental logic for SQL taps

Opened this issue ยท 6 comments

Feature scope

Taps (catalog, state, stream maps, etc.)

Description

The Singer norm is to use >= when doing an incremental sync with a bookmark. Its for good reason but if someone knows they arent at risk of missing records using only > then we should allow it.

I couldnt find where its documented but I know its to avoid ties that might have been missed on the last sync. For example if the bookmark is a date, and an incremental sync happens early on in the day, then more records arrive within that day, using > would skip those that arrive later in the day and the incremental sync wouldnt have worked as expected. So >= uses "at least once" semantics and requires deduping on the target end.

The use case is for reverse ETL where the target cant handle dupes. Specifically Squared's target-apprise implementation posting to singer-ecosystem-activity in slack. There will always be < 5 records so ties are almost impossible and I think I'd honestly prefer the edge case to be not sending a record than sending it twice. In this case I'd like the option to override the default >= filter to use >.

Another case is if someone uses an auto incrementing PK integer in a postgres database then theres no reason to be inclusive (although it has pretty minimal harm also), the user should have the preference if they know what theyre doing.

I suggest we add something like incremental_filter_inclusive or inclusive_filter (names are tough ๐Ÿ˜„ ) that defaults to the status quo but allows us to override.

Am I misunderstanding anything?

@pnadolny13 - Thanks for this suggestion, and thanks for raising. You make really good points about the fact that some developers can confidently assert that their use case does not include cases where the > comparison is insufficient. I think we have two paths forward:

On the developer side:

I am coming around to agreeing that developers should be able to declare if their replication key should be compared using > or >= operator.

If a developer does implement a faulty or 'lossy' comparison logic, that would rightfully be reported as a bug in the tap. Even if lost records only occur in 1 out of 10 million or 100 million instances, those are guaranteed to occur for a significant number of total users of the tap. It is absolutely critical that Singer maintain the promise of 'at least once' delivery because other systems that can only promise a certain amount of 'nines' are just completely out of the question when it comes to the needs of dependable ELT data pipelines. We need users to be able to assert not just 'it's very very unlikely Meltano missed records' but rather: 'it is systematically impossible for any records to be missed'.

On the user side:

I don't think the user should not have an option to take the gamble of maybe missing out on records, but we can give them access to "opt in" to a more advanced and costly dedupe algorithm. At the cost of a larger STATE artifact, and some additional compute time for comparisons, we can make this a universal config option for all SDK-based taps. I've added a proposed spec that would deliver this in:

cc @edgarrmondragon to check my logic here ๐Ÿ˜„

@aaronsteers thanks for your thoughts! I agree that we should default to current state thats the safest most robust way of running the tap but yeah having 2 power user mechanisms available would be awesome:

  1. the option to use only >
  2. dedupe ties using key hashes in the state file i.e. #161

Option 2 would solve my problem also but it might also be nice to have Option 1 to skip the hashing of IDs in the state file for the case that I'm super confident > will work and accept the risks. If we allow option 1 we should document it well and potentially log a warning or something so its abundantly clear that this is a risky move and should only be used if you know what youre doing.

stale commented

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

Still relevant

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

Still relevant