igniterealtime/openfire-hazelcast-plugin

Enhanced split-brain protection

guusdk opened this issue · 4 comments

As @GregDThomas suggested:
A new enhancement, that would require you to have an odd number of cluster nodes. Basically, assuming three nodes, you have to have two communicating to get a cluster. If you only have one node, you're not clustered.

@GregDThomas: I'm assuming here that the aim of this is to have a resolution where a majority of servers dictates the resulting state?

From first principles (apologies for teaching readers to suck eggs):

In Openfire terms, a split-brain occurs when two (or more) nodes in a cluster both think they are the senior node. E.g. in a two node cluster, the network between the two nodes is lost, neither node can see that the other node is available, so both assume it is in the senior. ref https://en.wikipedia.org/wiki/Split-brain_(computing).

A typical solution to this problem is to introduce the concept of a quorum value. A quorum value would be (nodecount/2+1) - e.g. 2 nodes in a 3 node cluster, 3 nodes in a four node cluster, 3 nodes in a 5 node cluster. ref https://en.wikipedia.org/wiki/Quorum_(distributed_computing)

So a proposal to implement this woud be:

(Note the distinction between a Hazelcast cluster and an Openfire cluster - they may be in different states)

If a quorum value is configured, when a node starts, Openfire clustering remains "starting" until the node can see the quorum number of nodes in the Hazelcast cluster. These nodes would then agree on a senior member (currently, it's the oldest member of the cluster, I don't see a need to change that).

When a node leaves the Openfire cluster and the number of remaining nodes is less than the quorum value, the remaining node(s) would disable clustering and then immediately re-enable it. Clustering would then, as above, remain "starting" until the node can see the quroum number of nodes in the cluster.

Possible further enhancements;
While clustering is "starting" waiting for quorum, reject any new connections (XMPP, Bosh, server:server, etc.) to that node.
If clustering is disabled due to lack of quorum nodes, drop all existing connections.
This would further ensure that the isolated node does not carry out any actions when it is not part of the cluster.

This seems to trade consistency for availability. I can imagine that there are scenarios in which each of the other is preferred. We'd need to make sure that this behavior is highly configurable.

Unless I'm misunderstanding, the suggested approach would basically reduce or remove service from the entire service, when one cluster node fails. My gut feeling says that most deployments would favor to not lock/log off the entire domain in such a scenario, choosing availability over consistency.

Yes, it is a trade off. Typically you'd need an odd number of nodes, and just under half of them will fail before you lose the whole cluster.

But to make it explicit, I was only expecting the above behaviour if a quorum number was set. If no quorum was set, behaviour is as it is today.

A, right, I misunderstood that. My interpretation was that the entire cluster should grind to a halt when just one node disappears. That's not what you suggested: it's basically when the cluster falls under half-plus-one of the anticipated cluster size.