EnterpriseDB/repmgr

2 standbys promoting

Closed this issue · 2 comments

When I simulated a network outage in a 4 node cluster (2 synchronous and 1 asynchronous standby, 2 data centers, 1 witness), the remaining synchronous standby made a failover as expected, but during the standby rejoin / follow of the old master and other standbys (via scripting), when the network was available again, the other synchronous standby also promoted itself, because the DB of the old master was stopped. Is it possible to prevent this by configuration, or do I have to rejoin the old master only after the other standbys?

The problem seems to be that a standby doesn't know whether another standby has already done a failover, if it wasn't reachable during this promotion. Because the order of server reappearance after a network failure is unclear (and also the loop position of the script), failover prevention by validation with a flag file from the first promotion could probably help.

I cannot reproduce the described behavior any longer, probably there went something wrong before, or maybe due to newer versions. Although there still may be two failovers at almost the same time (after a failure of the master together with a network split, and as i understand, the 2 reasons for this are: 1. the third standby that is not only asynchronous but also cascading, what i unfortunately dindn't mention first, and 2. the distinction between total nodes and remaining nodes during a failover), i meanwhile handle this by using some kind of "first-failover-wins" mechanism. Therefore the problem seems to be solved.