singer-io/tap-postgres

Add secondary watermark to handle robustly case where xmin spans many rows

Opened this issue · 0 comments

drdee commented

Currently, the xmin pseudo system column is used as watermark column for the initial load of a table. When the table is very large and it takes more than 6 hours to ingest the data it presents one of the following problems:

  1. Job gets stuck on ingesting xmin because it never processes all rows with same xmin in a 6 hour window. This happens, depending on the number of columns in the table, around 50M rows with the same xmin.
  2. Job does proceed on ingesting xmin with many rows but it typically takes two attempts. The first attempt happens near the end of the runtime window. This attempt will fail, the 2nd attempt will pass because there is a more runtime available because it's the first xmin being processed. However, the fist attempt will have loaded rows into the destination table and hence there will be duplicate data in the destination table that needs to be manually cleaned.

Adding support for a secondary watermark, either a timestamp column or an ID field will prevent both problems from happening.