singer-io/tap-postgres

Logical replication is not advancing if high amount of changes generated in different database

Opened this issue · 0 comments

Background:

  • Current code flushing changes and updates bookmark only when it encounters any messages for the replication slot.
  • Postgres reading log files sequentially but responds only with messages for replication slot database
  • If another database has a large amount of changes - Postgres does not have enough time to read the large log file and send messages for current replication slot that will advance LSN

Infrastructure:
Postgress instance:

  • database1 -> Generates xGB of log file in short burst -> has replication_slot1 -> replicates every hour
  • database2 -> Generates 2000 records every 6 hours -> has replication_slot2 -> replicates every 6 hours

Scenario:

  1. all of the changes are replicated from both instances and last LSN points to 10000
  2. database1 has a burst and generates 100Gb wal log files and advancing LSN to 20000
  3. database 2 adds 2000 records 1 hour after the burst and advancing LSN to 20100
  4. replication for database2 starts reading changes starting from 10000 LSN
  5. in 30 minutes DB is able to read logs only up to 15000 LSN -> No replication_slot2 messages are produced
  6. Since none of the messages where replicated - no flush has been issued and bookmark was not updated by this part of code
  7. Next time replication starts - it starts again from 10000 LSN

Possible Solution:

  • Advance bookmark using wal_end data from cursor if there are no messages has been produced
  • Trigger flush_lsn with the same wal_end value.