Split brain recovery 'too fast'

Question

Split brain recovery 'too fast'

evdherberg opened this issue 3 years ago · 4 comments

After

Recovering from split-brain; firing leftCluster()/joinedCluster() events

the ClusterListener proceeds to fireLeftClusterAndWaitToComplete. This is not guaranteed to wait for leftCluster handling on other nodes. Possibly this results in interleaving of execution of various left/joinCluster handlers.

2021.10.14 09:56:41 �[33mWARN �[m [hz.openfire.event-2]: org.jivesoftware.openfire.plugin.util.cache.ClusterListener - Recovering from split-brain; firing leftCluster()/joinedCluster() events
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.priority-generic-operation.thread-0]: com.hazelcast.internal.partition.InternalPartitionService - [172.60.0.20]:5701 [openfire] [3.12.5] Applied partition state update with version: 1808
2021.10.14 09:56:41 �[32mINFO �[m [hz.openfire.cached.thread-4]: com.hazelcast.nio.tcp.TcpIpConnector - [172.60.0.20]:5701 [openfire] [3.12.5] Connecting to /172.60.0.10:5701, timeout: 10000, bind-any: true
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-4]: com.hazelcast.nio.tcp.TcpIpEndpointManager - [172.60.0.20]:5701 [openfire] [3.12.5] Established socket connection between /172.60.0.20:40609 and /172.60.0.10:5701
2021.10.14 09:56:41 �[32mINFO �[m [hz.openfire.IO.thread-in-0]: com.hazelcast.nio.tcp.TcpIpConnection - [172.60.0.20]:5701 [openfire] [3.12.5] Initialized new cluster connection between /172.60.0.20:5701 and /172.60.0.10:55053
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.event-2]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Firing leftCluster() event
2021.10.14 09:56:41 �[32mINFO �[m [hz.openfire.event-2]: org.jivesoftware.openfire.cluster.ClusterManager - Firing left cluster event for this node
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.event-2]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Waiting for leftCluster() event to be called [timeout=30 seconds]
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-2]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Going to send 'join' presence stanzas on the local cluster node for 0 occupant(s) of cluster node b6a7aa53-e796-4d23-b2b1-b80650fe7181 that is now part of our cluster.
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-2]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Going to send 'join' presence stanzas on the local cluster node for 0 occupant(s) of cluster node b6a7aa53-e796-4d23-b2b1-b80650fe7181 that is now part of our cluster.
2021.10.14 09:56:41 �[32mINFO �[m [ClusterManager events dispatcher]: org.jivesoftware.util.cache.CacheFactory - Clustering stopped; cache migration complete
2021.10.14 09:56:41 �[32mINFO �[m [hz.openfire.IO.thread-in-1]: com.hazelcast.nio.tcp.TcpIpConnection - [172.60.0.20]:5701 [openfire] [3.12.5] Initialized new cluster connection between /172.60.0.20:40609 and /172.60.0.10:5701
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.spi.RoutingTableImpl - Restoring cache content for cache 'Routing Servers Cache' by adding all outgoing server routes that are connected to the local cluster node.
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.spi.RoutingTableImpl - Restoring cache content for cache 'Routing Components Cache' by adding all component routes that are connected to the local cluster node.
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.spi.RoutingTableImpl - Restoring cache content for cache 'Routing Users Cache', 'Routing AnonymousUsers Cache' and 'Routing User Sessions' by adding all client routes that are connected to the local cluster node.
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.spi.RoutingTableImpl - Adding client route user2@xmpp.localhost.example/jsxc-fc5ea58c
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.spi.RoutingTableImpl - Adding client route ClientRoute{nodeID=a7206884-1c97-4915-a969-68d5c33dce18, available=true} to users cache under key user2@xmpp.localhost.example/jsxc-fc5ea58c
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-4]: org.jivesoftware.openfire.muc.spi.OccupantManager - Replaced occupants with arguments null Occupant{roomName='muc1', nickname='user1', realJID=user1@xmpp.localhost.example/jsxc-7d2c6fd3, lastActive=2021-10-14T09:56:41.642849Z} 73230fd6-f29d-488e-aae4-775607a41246
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-4]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Going to send 'join' presence stanzas on the local cluster node for 1 occupant(s) of cluster node 73230fd6-f29d-488e-aae4-775607a41246 that is now part of our cluster.
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-4]: org.jivesoftware.openfire.muc.spi.MUCPersistenceManager - Attempting to load room 'muc1' from the database.
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-5]: org.jivesoftware.openfire.muc.spi.OccupantManager - Replaced occupants with arguments Occupant{roomName='muc1', nickname='user1', realJID=user1@xmpp.localhost.example/jsxc-7d2c6fd3, lastActive=2021-10-14T09:56:41.642849Z} null 73230fd6-f29d-488e-aae4-775607a41246
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-5]: org.jivesoftware.openfire.muc.spi.OccupantManager - Replaced occupants with arguments null Occupant{roomName='muc1', nickname='user1', realJID=user1@xmpp.localhost.example/jsxc-7d2c6fd3, lastActive=2021-10-14T09:56:41.652447Z} 73230fd6-f29d-488e-aae4-775607a41246
2021.10.14 09:56:41 �[32mINFO �[m [hz.openfire.cached.thread-5]: org.jivesoftware.openfire.muc.spi.OccupantManager - We received a copy of local MUC occupants from node 73230fd6-f29d-488e-aae4-775607a41246, but we already had this information. This hints at a possible inefficient sharing of data across the cluster.
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-5]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Going to send 'join' presence stanzas on the local cluster node for 1 occupant(s) of cluster node 73230fd6-f29d-488e-aae4-775607a41246 that is now part of our cluster.
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Cluster event: service conference.xmpp.localhost.example left a cluster - going to restore 0 rooms
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.OccupantManager - Reset occupants because we left the cluster
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.LocalMUCRoomManager - Restoring cache content for cache 'MUC Service 'conference' Rooms' after we left the cluster, by adding all MUC Rooms that are known to the local node.
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-4]: org.jivesoftware.openfire.group.GroupJID - Parsing JID from string: admin@xmpp.localhost.example
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-4]: org.jivesoftware.openfire.muc.spi.FMUCHandler - (room: 'muc1@conference.xmpp.localhost.example'): Changing outbound join configuration. Existing: null, New: null
2021.10.14 09:56:41 �[33mWARN �[m [hz.openfire.cached.thread-4]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - A remote cluster node (73230fd6-f29d-488e-aae4-775607a41246) tells us that user user1@xmpp.localhost.example/jsxc-7d2c6fd3 is supposed to be an occupant (using nickname 'user1') of a room named 'muc1' but the data in the cluster cache does not indicate that this is true.
2021.10.14 09:56:41 �[33mWARN �[m [hz.openfire.cached.thread-5]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - A remote cluster node (73230fd6-f29d-488e-aae4-775607a41246) tells us that user user1@xmpp.localhost.example/jsxc-7d2c6fd3 is supposed to be an occupant (using nickname 'user1') of a room named 'muc1' but the data in the cluster cache does not indicate that this is true.
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-5]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Finished sending 'join' presence stanzas on the local cluster node.
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.cached.thread-4]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Finished sending 'join' presence stanzas on the local cluster node.
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Going to send 'leave' presence stanzas on the local cluster node for 1 occupant(s) of one or more cluster nodes that are no longer part of our cluster.
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Stanza now being sent: <presence type="unavailable" to="user2@xmpp.localhost.example/jsxc-fc5ea58c" from="muc1@conference.xmpp.localhost.example/user1"><x xmlns="http://jabber.org/protocol/muc#user"><item role="none" jid="user1@xmpp.localhost.example/jsxc-7d2c6fd3"/></x></presence>
2021.10.14 09:56:41 �[33mWARN �[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Unable to find MUCRole for recipient 'user2@xmpp.localhost.example/jsxc-fc5ea58c' in room muc1@conference.xmpp.localhost.example while broadcasting 'leave' presence for occupants on disconnected cluster node(s).
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Finished sending 'leave' presence stanzas on the local cluster node.
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Cluster event: service conference.xmpp.localhost.example left a cluster - going to restore 2 rooms
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.OccupantManager - Reset occupants because we left the cluster
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.LocalMUCRoomManager - Restoring cache content for cache 'MUC Service 'conference' Rooms' after we left the cluster, by adding all MUC Rooms that are known to the local node.
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Going to send 'leave' presence stanzas on the local cluster node for 0 occupant(s) of one or more cluster nodes that are no longer part of our cluster.
2021.10.14 09:56:41 �[32mINFO �[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - This node (a7206884-1c97-4915-a969-68d5c33dce18/xmpp2.localhost.example) has left the cluster
2021.10.14 09:56:41 �[32mINFO �[m [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterMonitor - Sending message to admins: The local node ('xmpp2.localhost.example') has left the cluster - this node no longer has any resilience (enabled=true)
2021.10.14 09:56:41 �[36mDEBUG�[m [TaskEngine-pool-12]: org.jivesoftware.openfire.XMPPServer - Sending message to admin [jid=admin@xmpp.localhost.example, message=The local node ('xmpp2.localhost.example') has left the cluster - this node no longer has any resilience]
2021.10.14 09:56:41 �[36mDEBUG�[m [TaskEngine-pool-12]: org.jivesoftware.openfire.spi.RoutingTableImpl - Unable to route packet. No session is available so store offline. <message from="xmpp.localhost.example" to="admin@xmpp.localhost.example"><body>The local node ('xmpp2.localhost.example') has left the cluster - this node no longer has any resilience</body></message> 
2021.10.14 09:56:41 �[36mDEBUG�[m [TaskEngine-pool-12]: org.jivesoftware.openfire.spi.RoutingTableImpl - Failed to route packet to JID: admin@xmpp.localhost.example packet: <message from="xmpp.localhost.example" to="admin@xmpp.localhost.example"><body>The local node ('xmpp2.localhost.example') has left the cluster - this node no longer has any resilience</body></message>
2021.10.14 09:56:41 �[36mDEBUG�[m [TaskEngine-pool-12]: org.jivesoftware.openfire.MessageRouter - Message sent to unreachable address: <message from="xmpp.localhost.example" to="admin@xmpp.localhost.example"><body>The local node ('xmpp2.localhost.example') has left the cluster - this node no longer has any resilience</body></message>
2021.10.14 09:56:41 �[36mDEBUG�[m [ClusterManager events dispatcher]: org.jivesoftware.util.cache.CacheFactory - CacheFactory is not listening for cluster events anymore because it left the cluster
2021.10.14 09:56:41 �[36mDEBUG�[m [hz.openfire.event-2]: org.jivesoftware.openfire.plugin.util.cache.ClusterListener - Firing joinedCluster() event```

Answer 1 · 2021-10-14T18:34:42.000Z

Since creating this issue, Emiel and me have found that the logs that we used to illustrate this bug are actually caused by a different issue (namely: igniterealtime/Openfire#1886). However, I believe that the issue description itself might still be valid: invocation of the 'left' event that is part of the 'leave-and-rejoin' dance that is the split-brain recovery mechanism might have kicked of asynchronous processes (maybe on the other nodes?) that are still executing by the time the 'join' part of that dance starts. This has the potential to cause loads of confusion.

Answer 2 · 2021-10-20T12:17:28.000Z

Basically we continue the discussion of #23. Specifically, from @GregDThomas:

Currently, the "remote node has joined cluster" event is triggered by a Hazelcast event (the memberAdded method) - over which Openfire/HZ plugin has little control. I wonder if changing that to an Openfire specific message would help; the remote node, after joining the cluster, would send a message to all the other nodes to say "I'm here". In the case of split-brain, this wouldn't happen until after the tidy-up has happened.

I can imagine we stop doing "other node joined the cluster" handling based on the Hazelcast memberAdded event. So no more ClusterManager.fireJoinedCluster(nodeID.toByteArray(), true) from the plugin ClusterListener.

When a node joins the cluster, it first performs all of its own local joinedCluster handling. Only after that is done, the plugin initiates the announcement "hey I am a new node and I joined" to the other nodes. For this we create a new ClusterTask that triggers the ClusterManager.fireJoinedCluster(nodeID.toByteArray(), true) event after all.

This way we can hopefully guarantee that, in the split-brain scenario, other nodes will not start sending us their occupants before we finished clearing out the old occupant registration.

Does this sound like a feasible approach?

Answer 3 · 2021-10-20T12:37:52.000Z

Something along those lines seems sensible to me. @GregDThomas thoughts on this?

Some smaller concerns:

do we need to wait for the local member do have finished all of its joinedCluster() handling? That might introduce lenghty delays. If that's needed, so be it - but maybe it's enough to wait for split-brain resolution?
does this mechanism guarantee that we'll have exactly one (no more, no less) "senior cluster node" again when joinedCluster() and joinedCluster(byte[] nodeId) is fired? This currently is not the case, which is annoying.
does it make sense to include another new call that reports back (to ... the joining node?) that we've finished processing all join code (and that the cluster is therefor deemed stable again)?

Answer 4 · 2021-10-20T13:16:18.000Z

Yes, this all seems to hang together. Looking at @guusdk q's;

do we need to wait for the local member do have finished all of its joinedCluster() handling?

I suspect not. As you say, it should be sufficient to do the bare minimum of processing, before declaring availability. The hard part will be defining the minimum!

does this mechanism guarantee that we'll have exactly one (no more, no less) "senior cluster node" again

I think not. By my thinking, in a split brain, both nodes are "senior", just in two different clusters. When the two clusters merge, both will remain senior until one leaves the cluster and re-joins as junior.

Though having written that, there may be a way around it. When the cluster re-forms, both Node1 and Node2 think they are senior. When Node2 sees Node1 joining, it could assume (as it's senior) that Node1 is junior. Likewise, when Node1 sees Node2, it assumes that Node2 is junior. So both nodes have exactly one "senior" node - they just don't agree. This is only transitory though. Immediately the split brain processing kicks in. Node2 (say) leaves the cluster (becomes unclustered). Node1 remain the senior - and now only - node. Node2 rejoins the cluster, everyone agrees Node1 is Senior, Node2 is junior.

does it make sense to include another new call that reports back

I can't see what value that would add, but maybe there is some I can't see right now.