Cluster split implementation expects cached data to be fully replicated.

Question

Cluster split implementation expects cached data to be fully replicated.

guusdk opened this issue 3 years ago · 16 comments

This plugin introduces an implementation of org.jivesoftware.util.cache.Cache<K,V> that is backed by a Hazelcast-provided distributed data structure. Much of the clustering-related functionality depends on data in caches of this implementation to be shared amongst cluster nodes. For example: one node puts something in the cache, another node can access that data from a locally instantiated cache with the same name. The data is "synchronized under water" by Hazelcast.

Pretty much all caches, as well as the default, as defined in https://github.com/igniterealtime/openfire-hazelcast-plugin/blob/master/classes/hazelcast-cache-config.xml use a Hazelcast Map as the data structure that is used to back our Cache implementation. All of them seem to use a backup-count of 1. This means that data added by a node will be replicated to one other node. This is largely hidden from all usages of the Cache, as the data will be available on all nodes, even if the cluster is larger than 2 nodes. Nodes on which the data is accessed, but don't have it, will obtain it through Hazelcast-provided magic.

Although during normal run-time, the data is accessible to all cluster nodes (as described above), there does not seem to be a guarantee that all data is, at all times, readily available on all cluster nodes when the cluster is larger than two nodes. It is probably reasonable to assume that this is almost guaranteed to not be the case in larger clusters, as by default, the data will live on one node, and be backed-up to another. I'm not sure if there at times are more copies, but it's probably safe to assume that Hazelcast will eventually retain data on only two nodes.

Much of the code in Openfire that is executed when a cluster node drops out of the cluster (notably, implementations of org.jivesoftware.openfire.cluster.ClusterEventListener) is written with the expectation that all data is available on the local cluster node, for at least a number of (and possibly all) caches. This seems to be an error.

To illustrate: the following was observed (repeatedly) during testing: On a three-node cluster, where a different client/user that is subscribed to the presence of all other users was connected to each node, the senior node was disconnected from the cluster (I'm unsure if seniority is important). It is important to realize that at that point, Hazelcast won't be able to look up cache entries 'online'. Whatever it has on the local node will be what it can deal with - all other data is lost. The expected behavior would be that the client connected to the local node would receive a presence unavailable for its two contacts, by virtue of the routing table being cleaned up, having recognized that those two routes are now no longer available. In practise, we did not always see this happen. Often, we'd get presence unavailable for only one contact, instead of both.

We believe that what's going on here is that the disconnected node iterates over (or is otherwise dependent of - see Remark A below) on the routing table to send the presence unavailable's for the clients on the now unreachable cluster nodes. As there is no guarantee that all data exists on all cluster nodes, this might go wrong.

(I've actually provided a simplified description of the scenario that was used during testing: the test scenario that was actually used involved MUC. The 'offline' presence that is expected would be picked up by the conference service, to be broadcast to all local occupants. I believe that this nuance is not important)

A confusing characteristic of the implementation is that there seems to be overlap in the implementation of org.jivesoftware.openfire.plugin.util.cache.ClusterListener in the Hazelcast plugin (notably its cleanup routines) and the implementation of the org.jivesoftware.openfire.cluster.ClusterEventListener interface in various parts of the Openfire code base (such as RoutingTableImpl).

The issue described here will probably not be very apparent in a test environment, when the test environment does not consist of at least three cluster nodes. The default backup count of 1 will effectively cause replication "to the entire cluster" if the cluster consists of no more than two nodes.

Remark A: While penning this text, I have started to wonder if the problem described above (not guaranteed to have cluster data after a sudden disconnect) is what the lookupJIDList method in Hazelcast's ClusterListener class is trying to work around. That implementation is using a Hazelcast EntryListener (as implemented in the S2SCacheListener in that same ClusterListener definition) to keep track of all JIDs in many caches. If these EntryListeners get fired whenever data is modified on any cluster node, then this could ensure that at least some data-modification (it only seems to store JIDs) is guaranteed to be registered on all nodes as soon as that occurs on any node. If that's done synchronously, even better, but I have not checked yet. With this, and having identified the apparent duplication of code in Hazelcast's ClusterListener and some Openfire implementations of ClusterEventListener, combined with the fact that for some Caches, it'd make sense to me to update them 'last' (or at the very least in a predictable order), I wonder if a sensible course of action would be to remove code from Openfire, and have (only) the Hazelcast plugin be responsible for things like maintaining the RoutingTable state.

Answer 1 · 2021-09-07T14:38:03.000Z

~~Time constraints made me press 'submit' before I had time to review my text. I'll do that later. If there are any inconsistencies, my apologies.~~ I've now updated the text. It might still be the result of me rambling, but at least, it's less inconsistent. :)

Answer 2 · 2021-09-07T18:09:14.000Z

If we do choose to change things around, then we might find help in using a different Hazelcast distributed data structure than the Map that we're now using everywhere. Maybe some caches whould instead be using a ReplicatedMap? As opposed to Maps, ReplicatedMaps will make sure that each cluster node has the data. There are some obvious trade-offs of course (see the describion in the link).

We might want to consider using a ReplicatedMap instead of a Map to replace the lookupJIDList based fix. This would probably introduce a change in order of events, as replication of ReplicatedMap is documented to be asynchronous. The lookupJIDList-based fix depends on Hazelcast's EntryEventListener. This are documented to 'happen after' the operation occurs on the event, but there's also a warning in the documentation that reads: "Listeners have to offload all blocking operations to another thread (pool)." which suggests that the events themselves are fired from a shared thread. I'm not sure if that means they're synchronous to the invocation that changed the underlying map itself.

Answer 3 · 2021-09-07T18:13:05.000Z

I wonder if the unexpected behavior that I described in the issue description is a result of there being duplicated code in RoutingTableImp, that causes the code in this plugin that performs the cleanup to fail.

If that's the case, then this issue should probably be moved to Openfire.

Answer 4 · 2021-09-08T08:17:24.000Z

Yes, it does seem that with a three node cluster, not every node will no about all data, which means if it drops out of thee cluster problems will arise.

A ReplicatedMap does look like the right structure to use (I assume (hah!) it should be a simply swap out).

Re. your comment about duplicated code in RoutingTableImp I've long thought that there is too much logic in the clustering plugin. IMHO the plugin should solely deal with the artefacts of clustering - and rely on core Openfire to sort out what to do with data when a node joins/leaves. In fact ... #50 which has it's first birthday tomorrow!

Answer 5 · 2021-09-08T08:23:33.000Z

@GregDThomas any thoughts on using ReplicatedMap vs "manually" ensuring that the required data is available on all nodes, by registering EntryEventListener instances (like the Hazelcast now does)?

Answer 6 · 2021-09-08T08:30:12.000Z

I'm generally of the approach "If it ain't broke, don't fix it". If the EntryEventListener mechanism is working, then leave it, but it does sound like it's replicating what ReplicatedMap achieves out of the box. I note that ReplicatedMap is tagged @since 3.2 which looking at the change log wasn't around for nearly half of the plugins lifetime, so probably why it was never used in the first place.

Answer 7 · 2021-09-08T08:37:23.000Z

Not using EntryEventListener probably reduces the amount of code that we have to maintain ourselves (the implementation in the Hazelcast plugin is quite hard to understand), but it does lend itself to maintain only that state that we're interested in: the Hazelcast plugin only keeps track of JIDs. The caches often contain more data, which would require additional resources to keep things in sync on all nodes. Then again, Hazelcast probably does a more optimized job of that than what we can do ourselves?

A concern that I have is that timing might be different. I think that with EntryEventListener, we can be fairly certain that an event is triggered as soon as the cache entry is updated on a node. The ReplicatedMap might be more asynchronous. In theory, this could lead to a scenario where:

something gets send to a locally connected client, based some kind of action occuring on a remote cluster node.
the corresponding remote clusternode having added corresponding state in a cache
the cache replication not having occurred before a cluster disconnect occurs
the local (now disconnected) cluster node not being able to 'clean up', because it doesn't see the relevant state in the cache.

That said:

There's probably also a chance of this happening with EntryEventListener-based approaches
In case of cluster failing, to what extend can we guarantee anything, in the first place?

Answer 8 · 2021-09-08T08:44:43.000Z

I think we need to recognise that if a cluster leaves a node (split brain or not) there is a risk some data will be lost. There is always a delay between (say) a user logging in, the local node updating it's local data, and that data being replicated across the other nodes. The amount of data lost depends on exactly when the problem occurs.

Split-brain is an interesting one. The false-senior node has, and will always have, the wrong view of data from the rest of the cluster. I don't think there's an easy way to recover from that (even with ReplicatedMap - what happens if the data held by the false-senior is inconsistent with the rest of the cluster? Someone has to be wrong). Hence #51 which basically prevents the situation from happening (the 'false-senior' node essentially disconnects all clients, throws away all it's data, and waits until it rejoins the cluster).

Answer 9 · 2021-09-08T09:13:08.000Z

Split-brain: One headache at a time please :)

Moving 'cleanup' from Hazelcast plugin to Openfire makes a lot of sense to me.

Another concern that I have is that if we explicitly ask for a cache entry (maybe under lock), would that guarantee the replicatedmap to be updated with the latest data for that entry, or would the local node get old state?

Answer 10 · 2021-09-08T09:20:32.000Z

Another concern that I have is that if we explicitly ask for a cache entry (maybe under lock), would that guarantee the replicatedmap to be updated with the latest data for that entry, or would the local node get old state?

https://docs.hazelcast.com/imdg/4.2/data-structures/replicated-map.html says

the Replicated Map behaves like an eventually consistent system

which sounds like to me that if node 1 writes and node 2 reads after node 1 writes, there will be a period of time before node 2 sees what node 1 has written. I suspect IMap is the same, tbh. Sounds like you're pushing clustering harder than it has been before!

Answer 11 · 2021-09-08T09:28:41.000Z

Let me try to explain my concern differently: A ReplicatedMap (which does not inherit from IMap) does not seem to support locking at all. In Openfire code, we almost exclusively obtain a lock on a cache entry, before 'doing anything', to try and guarantee consistency in actions taken, and cache state. I think with ReplicatedMap, this is impossible?

Answer 12 · 2021-09-08T09:39:31.000Z

Ah ISWYM. Yes, with no lock on a ReplicatedMap the current lock/do stuff/unlock mechanism would require thinking about.

First thought;
Have an IMap just for locks.
Cache.lock(key) calls lockCache.lock(replicatedMap.name + "#" + key)

Answer 13 · 2021-09-08T09:41:44.000Z

I don't think that there's any guarantee that replication in the cache would have actually happened before the lock has been released (and/or before another thread/node acquires a lock for the same key again).

Answer 14 · 2021-09-08T09:53:46.000Z

Oh, good point. Perhaps we're stuck with an IMap that is "manually" fully populated using a listener. Or perhaps more simply, increase the IMap backup-count from 1 to it's maximum value of 6 (assuming a cluster has less than 7 nodes).

Answer 15 · 2021-09-08T11:01:04.000Z

I'm not sure that the backup-count does what we expect.
https://docs.hazelcast.com/imdg/3.12/data-structures/map.html#enabling-backup-reads
This infers that we need to do something explicit in order to read from another node's backup. Suspect this is different from reading the DistributedObject.

Answer 16 · 2021-09-08T13:59:30.000Z

Assuming that a cluster doesn't have more than 6 nodes isn't a hard requirement that I'd like to introduce.

Replacing the entire Cache with a solution that fully depends on the MapEntryListener mechanism feels like applying to much ducttape. We would effectively be building a distributed datastructure from Hazelcast events, even though Hazelcast almost explicitly doesn't offer that implementation. I fear that they're not offering that implementation for a good reason, one that we'll find out about in the hard way. Also (and more concrete) I fear that very similar synchronization issues could exist. Although I've not verified this, I'd not be surprised if the event listeners are fired outside of any lock that's been held.

@Fishbowler is probably completely right with regards to backups. We thought we've seen improvements by moving to ReplicatedMap in a test setup, but it turns out that we didn't actually change to ReplicatedMap at all (changing the type of the datastructure configuration doesn't mean that the datastructure that is used changes type - the code still instantiates the cache as an IMap, not a ReplicatedMap instance. The only thing that we've changed is to remove the configuration of the Map, replacing it with configuration that's not being used for a ReplicatedMap by the same name.

Changing to ReplicatedMap would require implementation changes, which would either require all caches to switch type, or would require additional API, for Map and ReplicatedMap-based caches to coexist.

In all, I think most pragmatically, moving forward with what we have now (Hazelcast Maps, combined with selected bits of the cache being stored in a local datastructure that's populated with EntryListener), seems like the best solution to me for now. Moving that from this plugin to Openfire, to reduce duplication (#50).