Feature request: Quorum

Question

Feature request: Quorum

Opened this issue 6 years ago · 2 comments

This is going to be a pretty big feature request, so something like this will probably go into a major release like version 4.2 or even 5. (If you decide to implement it)
Right, so, what is a Quorum?
Before we dive into that question, let's look at what we have today.

First, let's assume a 4 nodes setup, with 1 primary and 3 standbys, 2 nodes in each site (geographically far away from each other)
As of today, using the "location" parameter to avoid a split brain works but it has 1 flaw, that is, if one of your sites really is down, say, a disaster has occurred, and that site happens to be the one with the primary in it, there won't be an automatic failover, since repmgrd will assume there's a network disconnection (which is the safe and correct thing to do)

Now let's assume a 5 nodes setup, with 1 primary, 3 standbys and 1 witness, 2 nodes in one site (geographically far away from each other) and the other 2 nodes along with the witness in the other site.
As for using a witness (instead of "location" parameter), that would work only when the primary is in-fact in the majority site. That is to say, if a network disconnection happens and the primary is in the minority site, the other site will have a majority and therefor one of the nodes in that site will be promoted, creating a split brain.
Implementing a quorum could solve this, and it would mean having the minority site go down on purpose, and the majority site become the new primary.

When using both a witness and location parameter, location takes priority and a split brain is avoided, but the same flaw location has comes back, where if a site were to actually go down, the other site will not promote itself.

Now, one of the issues here is that 2 sites aren't enough. There's nothing that can be done (at least that I know of) between 2 sites alone to tell whether or not the entire other site really is down or if it's just disconnected because of some network issue at 100% of the cases. In my opinion, the "location" parameter is probably the best you can do. Generally, if the majority site is actually down (due to a disaster), the minority site will not come up, and it may seem pointless to have it in the first place, since it's main purpose is to come up when the other site is down.

A third site has the potential to solve this problem, so now let's assume a 5 nodes setup, with 1 primary, 3 standbys and let's call our special 5th node, arbitrator.
The feature I'm requesting is similar (or perhaps the exact same) as described here: http://galeracluster.com/documentation-webpages/arbitrator.html
and here: http://galeracluster.com/documentation-webpages/weightedquorum.html
Alright, so with the arbitrator in our third site, we can now tell which site should be up and which site should be down in case they don't see each other. How come?

Well, let's say site 1 has the primary and standby, site 2 has 2 standbys and site 3 is the arbitrator.
If site 1 gets disconnected (from all of the sites), it goes down because it's in the minority (this is important) and site 2 promotes with the arbitrator as it's "witness" that the other site really is disconnected and brought itself down because it became a minority.
If site 1 is disconnected only from site 2, site 2 can tell through the arbitrator that site 1 is actually still up and will not promote itself - and perhaps even replicate through the arbitrator, like a standby would with cascading replication.
This idea does deviate from the link I included from galera, where the arbitrator doesn't take part in the replication. This isn't the main concern though.
If site 1 goes down, simple enough site 2 and 3 are a majority.
If site 2 gets disconnected from both sites, site 2 is in the minority, so it doesn't self-promote.
What happens when site 3 is the one disconnected from the rest or is down? Well the rest are a majority so it's all fine.
Maybe all sites are disconnected at once? They are all minorities, there should be no primary at this point. When connection comes back up, the last node that was a primary should come up. (I hope it makes sense)

One very important rule in a Quorum is to bring down the primary in-case it is caught up in a situation where it's in the minority. This is the main missing component from what we have today.
So, to answer our initial question, a quorum is this concept, having the maximum possible availability for the database. It includes this rule as a part of it, and the logic that follows up because of it.

I'm unsure whether all the situations are clear, but I do think a quorum would be ideal. I'll even go as far as to say that if a change like this will be implemented in the future, it will solidify repmgr's position in the organization I work for as the main PostgreSQL replication tool for cluster management. We are looking at other tools because of this problem, but that type of research takes time.

You may have plans to implement this is the future, however I couldn't find any discussion on this topic (maybe I wasn't looking hard enough), so I decided I should bring it up.
I would also like to mention that I'm no expert in this topic, but this is one of the main concerns my organization has at the moment.
Hopefully this mountain of text is OK and I was clear, I would very much appreciate your response to this.
If you'd like me to clarify anything, please let me know and I will do my best, thank you :)

Answer 1 · 2018-08-21T01:04:17.000Z

Thanks for the detailed explanation, it's very helpful to have feedback based on actual use-cases. We can't guarantee anything right now but will take a closer look with consideration for a (hopefully not too distant) future release.

Answer 2 · 2019-02-25T10:54:57.000Z

I have the same big issue with the current way location is so arbitrary, It would already help to verify the following in the decision tree:

Use ping and ssh to see if there is a network split. Since you already have ssh access, perform a remote shell, and check for the PID of the postgresql instance, if it's missing, you can assume a few things

it's not a network split since you can ssh and ping, when the master is down, there is another problem and a slave should be promoted.
in case ping works but ssh doesn't, there might be problem with the machine (unresponsive). It would be a good idea to promote slave here as well.
in case neither ping + ssh work , it's time to consider a network split. Assuming that no-one will just reboot a machine that is running as a master, chances are big there is a network issue.

I've been going through the source code of repmgr superficially and I have the impression this is the way it should work, it talks about pings etc but in practice, I don't see this working. I could use some information on the intented failover strategies.

But if someone does reboot my server , in the case of different location string, the master will come back and slaves will follow it but the service gets an outtage. you are protected against network triggered failovers, but at the same time, there is no true HA anymore. In fact, I cannot make it failover in case of a db crash a stop of the process. In case you have the same location string, it's quite trigger happy to failover , a simple microcut in the network will trigger it. And the problem is that when you have a network cut, the remote commands don't always work. So implementing this with 2 network interfaces might be advisable

Right now, I don't see this functioning like that in 4.2. There is a lack of documentation imho on the flow of checks . I've tested with different location strings and if I shut down the master server, no failover happens. If I shut down the psql process, no failover happens. It's not very intelligent.

I'm having a huge issue with the way this works right now. The options on the table now are

don't autostart postgresql at machine boot, to prevent an 'old' master server to act as a detached primary again while a slave has already been promoted. This gives a ton of issues having 2 writable nodes at our disposal.

I wrote a haproxy check that will cut off the clients in order to prevent data being saved in multiple servers. see https://github.com/gplv2/haproxy-postgresql

But still , this is somehow problematic in the case of a network glitch, I don't want to failover , I need a grace time . Failing over with network problem between nodes in a single datacenter, I agree it's bad already, but having spontaneous failover due to that is pretty messy.