singer-io/tap-postgres

During incremental copy, the last row is always updated

Opened this issue · 6 comments

I cannot think of a reason why we have ">=" instead of just ">" for INCREMENTAL replication logic?

It seems that we always end up updating the last row. Why is this a desired behavior?

if replication_key_value:
select_sql = """SELECT {}
FROM {}
WHERE {} >= '{}'::{}
ORDER BY {} ASC""".format(','.join(escaped_columns),
post_db.fully_qualified_table_name(schema_name, stream['table_name']),
post_db.prepare_columns_sql(replication_key), replication_key_value, replication_key_sql_datatype,
post_db.prepare_columns_sql(replication_key))

I have the same doubt.

I agree, it should be ">". I've tested this change and am no longer getting the duplicates. Hoping we can get this change made to master.

In fact, it's normal to have ">=" to avoid loss of data.
If 2 row are written on disk sequentially but have the same "updated_at" value, and the extraction is made between the 2 writes, you will lose data.

What's the best way to avoid duplicated rows then? I haven't run into this issue with any of the other taps I've worked with.

The target must manage that and upsert or merge the New data.

@NicolasRisi Any recommends when original table doesn't have the PK but target has PK?

so WHERE {} >= '{}'::{}

this makes duplicated key for the last row as @timmysuh mentioned when rerun the query in this case.