Split-brain recovery does not follow documented process
GregDThomas opened this issue · 3 comments
Steps to reproduce:
- Setup a two-node Openfire cluster. Login to the admin console of each node, check both nodes show both cluster members at http://localhost:9090/system-clustering.jsp
- On the junior node, disable networking (or remove the network cable)
- Confirm that after a brief period of time, both nodes now show that they are the senior member of a single node cluster
- Re-enable/re-connect the network on the junior node.
- Wait for Hazelcast to re-establish the network.
Expected results:
- The cluster re-forms, with one senior, one junior member.
- Any
ClusterEventListener
on the junior member receivesleftCluster()
followed byjoinedCluster()
events - http://download.igniterealtime.org/openfire/docs/latest/documentation/javadoc/org/jivesoftware/openfire/cluster/ClusterEventListener.html#markedAsSeniorClusterMember--
Actual results:
- The cluster re-forms, but the junior node does not receive an indication that it has been demoted.
Cluster initially forms:
2019.04.23 15:02:38 INFO [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - This node (9c528cce-d3f0-4d6a-9e0d-3fd775b542f2/openfire2.example.com) has joined the cluster
Network is disabled:
2019.04.23 15:05:08 INFO [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - Another node (62a9c948-9991-4704-a323-4ec937a741cd/<unknown>) has left the cluster
2019.04.23 15:05:14 INFO [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - This node (9c528cce-d3f0-4d6a-9e0d-3fd775b542f2/openfire2.example.com) is now the senior member
Network is re-enabled:
2019.04.23 15:07:35 INFO [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - Another node (62a9c948-9991-4704-a323-4ec937a741cd/openfire1.example.com (10.215.75.172)) has joined the cluster
Sequence of events now logged as follows:
Cluster initially forms:
2019.04.24 10:17:18 INFO [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - This node (8e97db5d-8fb7-422b-bee6-f3a61a9d38b0/openfire2.example.com) has joined the cluster [seniorMember=openfire1.example.com (10.215.75.172)]
Network is disabled:
2019.04.24 10:18:11 INFO [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - Another node (d63cc58b-44a5-4b29-83f8-cf1e55540965/openfire1.example.com (10.215.75.172)) has left the cluster [seniorMember=openfire2.example.com (10.215.75.174)]
2019.04.24 10:18:11 INFO [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - Sending message to admins: openfire1.example.com (10.215.75.172) has left the cluster - there is now only 1 node in the cluster (enabled=true)
2019.04.24 10:18:11 INFO [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - This node (8e97db5d-8fb7-422b-bee6-f3a61a9d38b0/openfire2.example.com) is now the senior member
Network is re-enabled:
2019.04.24 10:22:14 INFO [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - This node (8e97db5d-8fb7-422b-bee6-f3a61a9d38b0/openfire2.example.com) has left the cluster [seniorMember=<unknown>]
2019.04.24 10:22:14 INFO [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - Sending message to admins: The local node ('openfire2.example.com') has left the cluster - this node no longer has any resilience (enabled=true)
2019.04.24 10:22:14 INFO [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - This node (8e97db5d-8fb7-422b-bee6-f3a61a9d38b0/openfire2.example.com) has joined the cluster [seniorMember=openfire1.example.com (10.215.75.172)]
I believe this fix might have introduced an issue. When recovering from a split-brain scenario:
- the senior member sees the to-be junior member join the cluster. The
org.jivesoftware.openfire.cluster.ClusterEventListener#joinedCluster(byte[])
methods on the senior member get invoked, allowing listeners to process the fact that another member has joined the cluster. - the to-be junior member detects that it no longer is senior, and as a result it triggers this new code block:
logger.warn("Recovering from split-brain; firing leftCluster()/joinedCluster() events");
ClusteredCacheFactory.fireLeftClusterAndWaitToComplete(Duration.ofSeconds(30));
logger.debug("Firing joinedCluster() event");
ClusterManager.fireJoinedCluster(true);
The event in step 2 causes ClusterEventListener#leftCluster()
and ClusterEventListener#joinCluster()
event handlers to be triggered on the to-be junior member only (not on the other nodes in the cluster).
A problem arises when the senior member, based on the event in step 1, sends the to-be junior member data, which arrives at the to-be junior member before step 2 has been executed, as The leftCluster()
and joinCluster()
invocations in step 2 are likely to 'reset' data in the to-be junior node (which is the reason for step 2 to be executed in the first place, I think). After this has occurred, the data that was already sent by the senior member is lost.
We have been trying to verify the above by introducing a 30+ second delay (which is how long step 2 can take), to the implementation that causes the senior node to send the to-be junior node its data in step 1. This is an attempt to force step 1 to happen after step 2 has finished. This (obviously very sub-optimal fix) did resolve our issues.
Should the split-brain recovery solution be modified so that this resolution (the leave/join cycle) is guaranteed to have happened before the other nodes are be made aware that a new node joined? Is this even possible?
Should the split-brain recovery solution be modified so that this resolution (the leave/join cycle) is guaranteed to have happened before the other nodes are be made aware that a new node joined?
Yes, seems sensible.
Is this even possible?
Currently, the "remote node has joined cluster" event is triggered by a Hazelcast event (the memberAdded method) - over which Openfire/HZ plugin has little control. I wonder if changing that to an Openfire specific message would help; the remote node, after joining the cluster, would send a message to all the other nodes to say "I'm here". In the case of split-brain, this wouldn't happen until after the tidy-up has happened.