Split-brain recovery does not follow documented process

Question

Split-brain recovery does not follow documented process

GregDThomas opened this issue 5 years ago · 3 comments

GregDThomas commented 5 years ago

Steps to reproduce:

Setup a two-node Openfire cluster. Login to the admin console of each node, check both nodes show both cluster members at http://localhost:9090/system-clustering.jsp
On the junior node, disable networking (or remove the network cable)
Confirm that after a brief period of time, both nodes now show that they are the senior member of a single node cluster
Re-enable/re-connect the network on the junior node.
Wait for Hazelcast to re-establish the network.

Expected results:

The cluster re-forms, with one senior, one junior member.
Any ClusterEventListener on the junior member receives leftCluster() followed by joinedCluster() events - http://download.igniterealtime.org/openfire/docs/latest/documentation/javadoc/org/jivesoftware/openfire/cluster/ClusterEventListener.html#markedAsSeniorClusterMember--

Actual results:

The cluster re-forms, but the junior node does not receive an indication that it has been demoted.

Cluster initially forms:

2019.04.23 15:02:38 INFO  [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - This node (9c528cce-d3f0-4d6a-9e0d-3fd775b542f2/openfire2.example.com) has joined the cluster

Network is disabled:

2019.04.23 15:05:08 INFO  [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - Another node (62a9c948-9991-4704-a323-4ec937a741cd/<unknown>) has left the cluster
2019.04.23 15:05:14 INFO  [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - This node (9c528cce-d3f0-4d6a-9e0d-3fd775b542f2/openfire2.example.com) is now the senior member

Network is re-enabled:

2019.04.23 15:07:35 INFO  [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - Another node (62a9c948-9991-4704-a323-4ec937a741cd/openfire1.example.com (10.215.75.172)) has joined the cluster

Answer 1 · 2019-04-24T09:27:30.000Z

Sequence of events now logged as follows:

Cluster initially forms:

2019.04.24 10:17:18 INFO  [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - This node (8e97db5d-8fb7-422b-bee6-f3a61a9d38b0/openfire2.example.com) has joined the cluster [seniorMember=openfire1.example.com (10.215.75.172)]

Network is disabled:

2019.04.24 10:18:11 INFO  [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - Another node (d63cc58b-44a5-4b29-83f8-cf1e55540965/openfire1.example.com (10.215.75.172)) has left the cluster [seniorMember=openfire2.example.com (10.215.75.174)]
2019.04.24 10:18:11 INFO  [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - Sending message to admins: openfire1.example.com (10.215.75.172) has left the cluster - there is now only 1 node in the cluster (enabled=true)
2019.04.24 10:18:11 INFO  [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - This node (8e97db5d-8fb7-422b-bee6-f3a61a9d38b0/openfire2.example.com) is now the senior member

Network is re-enabled:

2019.04.24 10:22:14 INFO  [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - This node (8e97db5d-8fb7-422b-bee6-f3a61a9d38b0/openfire2.example.com) has left the cluster [seniorMember=<unknown>]
2019.04.24 10:22:14 INFO  [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - Sending message to admins: The local node ('openfire2.example.com') has left the cluster - this node no longer has any resilience (enabled=true)
2019.04.24 10:22:14 INFO  [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - This node (8e97db5d-8fb7-422b-bee6-f3a61a9d38b0/openfire2.example.com) has joined the cluster [seniorMember=openfire1.example.com (10.215.75.172)]

Answer 2 · 2021-08-31T12:48:37.000Z

I believe this fix might have introduced an issue. When recovering from a split-brain scenario:

the senior member sees the to-be junior member join the cluster. The org.jivesoftware.openfire.cluster.ClusterEventListener#joinedCluster(byte[]) methods on the senior member get invoked, allowing listeners to process the fact that another member has joined the cluster.
the to-be junior member detects that it no longer is senior, and as a result it triggers this new code block:

logger.warn("Recovering from split-brain; firing leftCluster()/joinedCluster() events");
ClusteredCacheFactory.fireLeftClusterAndWaitToComplete(Duration.ofSeconds(30));
logger.debug("Firing joinedCluster() event");
ClusterManager.fireJoinedCluster(true);

The event in step 2 causes ClusterEventListener#leftCluster() and ClusterEventListener#joinCluster() event handlers to be triggered on the to-be junior member only (not on the other nodes in the cluster).

A problem arises when the senior member, based on the event in step 1, sends the to-be junior member data, which arrives at the to-be junior member before step 2 has been executed, as The leftCluster() and joinCluster() invocations in step 2 are likely to 'reset' data in the to-be junior node (which is the reason for step 2 to be executed in the first place, I think). After this has occurred, the data that was already sent by the senior member is lost.

We have been trying to verify the above by introducing a 30+ second delay (which is how long step 2 can take), to the implementation that causes the senior node to send the to-be junior node its data in step 1. This is an attempt to force step 1 to happen after step 2 has finished. This (obviously very sub-optimal fix) did resolve our issues.

Should the split-brain recovery solution be modified so that this resolution (the leave/join cycle) is guaranteed to have happened before the other nodes are be made aware that a new node joined? Is this even possible?

Answer 3 · 2021-08-31T14:39:07.000Z

Should the split-brain recovery solution be modified so that this resolution (the leave/join cycle) is guaranteed to have happened before the other nodes are be made aware that a new node joined?

Yes, seems sensible.

Is this even possible?

Currently, the "remote node has joined cluster" event is triggered by a Hazelcast event (the memberAdded method) - over which Openfire/HZ plugin has little control. I wonder if changing that to an Openfire specific message would help; the remote node, after joining the cluster, would send a message to all the other nodes to say "I'm here". In the case of split-brain, this wouldn't happen until after the tidy-up has happened.