igniterealtime/openfire-hazelcast-plugin

Cleanup should not start before event handlers have finished

guusdk opened this issue · 1 comments

When the plugin detects that a remote cluster node leaves the cluster, it fires off an event to reflect this (which invokes leftCluster(nodeId)) on all listeners. The plugin then cleans up things. The corresponding code can mostly be found in

// Trigger event that a node left the cluster
ClusterManager.fireLeftCluster(nodeID.toByteArray());
// Clean up directed presences sent from entities hosted in the leaving node to local entities
// Clean up directed presences sent to entities hosted in the leaving node from local entities
cleanupDirectedPresences(nodeID);
if (!seniorClusterMember && isSeniorClusterMember()) {
seniorClusterMember = true;
ClusterManager.fireMarkedAsSeniorClusterMember();
}
cleanupNode(nodeID);
// Remove traces of directed presences sent from local entities to handlers that no longer exist.
// At this point c2s sessions are gone from the routing table so we can identify expired sessions
XMPPServer.getInstance().getPresenceUpdateHandler().removedExpiredPresences();

There is a race condition here, that is caused by the invocation of the event listeners (in the first line) to be asynchronous. It causes the execution of that invocation to be parallel to the execution of the cleanup. This can cause problems for implementations of the event listener, if they operate on the same data. As the clean-up modifies pretty low-level state (routing table, session manager), the chance of indirect interference is pretty high.

It would probably be good to separate the two, to make sure that there's no interference. I assume (but haven not verified) that the existing order of execution (first event listeners, then cleanup) is OK (although maybe it's worth investigating if that needs reversal). In any case, the first should have had time to finish execution before the second is started. Some kind of callback implementation could be used to achieve this.

This probably isn't needed if we fix #50 as the callback would be used to invoke code that has by then be migrated to Openfire.