Failure to recover from cluster restart
GregDThomas opened this issue · 3 comments
[Transferred from https://issues.igniterealtime.org/projects/HZ/issues/HZ-3]
As reported in https://discourse.igniterealtime.org/t/openfire-cluster-unable-to-recover-from-nodes-crashing/76594, the following steps leads to NullPointer- and ClassCasting exceptions.
- Send a message from client A (connected to node A) to client B (connected to node B)
- Client B receives the message
- Send a SIGTERM to the OpenFire process running on node A
- Restart OpenFire on node A
- Reconnect client A
- Send a message from client A (connected to node A) to client B (connected to node B)
- Client B receives the message
- Send a SIGTERM to the OpenFire process running on node B
- Restart OpenFire on node B
- Reconnect client B
- Send a message from client A (connected to node A) to client B (connected to node B)
Logs from Node B:
2016.08.23 07:24:12 INFO [TaskEngine-pool-2]: com.hazelcast.core.LifecycleService - [10.12.3.44]:5701 [openfire] [3.5.1] Address[10.12.3.44]:5701 is STARTED
2016.08.23 07:24:12 WARN [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterManager - Null value is not allowed!
java.lang.NullPointerException: Null value is not allowed!
at com.hazelcast.util.Preconditions.checkNotNull(Preconditions.java:41)
at com.hazelcast.map.impl.proxy.MapProxySupport.putAllInternal(MapProxySupport.java:862)
at com.hazelcast.map.impl.proxy.MapProxyImpl.putAll(MapProxyImpl.java:309)
at org.jivesoftware.openfire.plugin.util.cache.ClusteredCache.putAll(ClusteredCache.java:129)
at org.jivesoftware.util.cache.CacheFactory.joinedCluster(CacheFactory.java:738)
at org.jivesoftware.openfire.cluster.ClusterManager$2.run(ClusterManager.java:95)
2016.08.23 07:24:35 ERROR [socket_c2s-thread-3]: org.jivesoftware.openfire.spi.RoutingTableImpl - Primary packet routing failed
java.lang.ClassCastException: org.jivesoftware.util.cache.DefaultCache cannot be cast to org.jivesoftware.openfire.plugin.util.cache.ClusteredCache
at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.getLock(ClusteredCacheFactory.java:444)
at org.jivesoftware.util.cache.CacheFactory.getLock(CacheFactory.java:415)
at org.jivesoftware.openfire.spi.RoutingTableImpl.getRoutes(RoutingTableImpl.java:823)
at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToBareJID(RoutingTableImpl.java:539)
at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableImpl.java:308)
at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:244)
at org.jivesoftware.openfire.MessageRouter.route(MessageRouter.java:139)
at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:83)
at org.jivesoftware.openfire.net.StanzaHandler.processMessage(StanzaHandler.java:378)
at org.jivesoftware.openfire.net.ClientStanzaHandler.processMessage(ClientStanzaHandler.java:113)
at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:232)
at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:199)
at org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandler.java:181)
at org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:690)
at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109)
at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
at org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:407)
at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:236)
at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
at org.apache.mina.core.filterchain.IoFilterEvent.fire(IoFilterEvent.java:74)
at org.apache.mina.core.session.IoEvent.run(IoEvent.java:63)
at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTask(OrderedThreadPoolExecutor.java:769)
at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTasks(OrderedThreadPoolExecutor.java:761)
at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.run(OrderedThreadPoolExecutor.java:703)
at java.lang.Thread.run(Thread.java:745)
2016.08.23 07:24:35 ERROR [socket_c2s-thread-3]: org.jivesoftware.openfire.nio.ConnectionHandler - Closing connection due to error while processing message: <message id="b10f9e39-4e5f-413f-9b13-2e5463a5b09d" to="user20@xmpp" type="chat"><body>bar</body></message>
[Comment from Nathan Neulinger @ https://issues.igniterealtime.org/browse/HZ-3?focusedCommentId=25103]
FYI, in case it helps you any - I found that a "fast" restart of openfire causes this every time, but if you leave it down for a while (over 30 seconds by default I think) between 3-4, 8-9 - it is more generally able to come back online.
I thought I had opened a jira issue on this, but can't seem to find it now.
[Comment from Nathan Neulinger @ https://issues.igniterealtime.org/browse/HZ-3?focusedCommentId=25104]
Ah, it was in the community, not JIRA: https://discourse.igniterealtime.org/t/openfire-hazelcast-cant-handle-fast-restart-of-node-but-ok-if-slow/62775
Believe this has been fixed by numerous improvements in core Openfire
https://issues.igniterealtime.org/browse/OF-1707
https://issues.igniterealtime.org/browse/OF-1701
https://issues.igniterealtime.org/browse/OF-1700