Logical replication is not advancing if high amount of changes generated in different database

Question

Opened this issue 4 years ago · 0 comments

Background:

Current code flushing changes and updates bookmark only when it encounters any messages for the replication slot.
Postgres reading log files sequentially but responds only with messages for replication slot database
If another database has a large amount of changes - Postgres does not have enough time to read the large log file and send messages for current replication slot that will advance LSN

Infrastructure:
Postgress instance:

database1 -> Generates xGB of log file in short burst -> has replication_slot1 -> replicates every hour
database2 -> Generates 2000 records every 6 hours -> has replication_slot2 -> replicates every 6 hours

Scenario:

all of the changes are replicated from both instances and last LSN points to 10000
database1 has a burst and generates 100Gb wal log files and advancing LSN to 20000
database 2 adds 2000 records 1 hour after the burst and advancing LSN to 20100
replication for database2 starts reading changes starting from 10000 LSN
in 30 minutes DB is able to read logs only up to 15000 LSN -> No replication_slot2 messages are produced
Since none of the messages where replicated - no flush has been issued and bookmark was not updated by this part of code
Next time replication starts - it starts again from 10000 LSN

Possible Solution:

Advance bookmark using wal_end data from cursor if there are no messages has been produced
Trigger flush_lsn with the same wal_end value.