igniterealtime/openfire-hazelcast-plugin

Failure to recover from cluster restart

GregDThomas opened this issue · 3 comments

[Transferred from https://issues.igniterealtime.org/projects/HZ/issues/HZ-3]

As reported in https://discourse.igniterealtime.org/t/openfire-cluster-unable-to-recover-from-nodes-crashing/76594, the following steps leads to NullPointer- and ClassCasting exceptions.

  1. Send a message from client A (connected to node A) to client B (connected to node B)
  2. Client B receives the message
  3. Send a SIGTERM to the OpenFire process running on node A
  4. Restart OpenFire on node A
  5. Reconnect client A
  6. Send a message from client A (connected to node A) to client B (connected to node B)
  7. Client B receives the message
  8. Send a SIGTERM to the OpenFire process running on node B
  9. Restart OpenFire on node B
  10. Reconnect client B
  11. Send a message from client A (connected to node A) to client B (connected to node B)

Logs from Node B:

2016.08.23 07:24:12 INFO  [TaskEngine-pool-2]: com.hazelcast.core.LifecycleService - [10.12.3.44]:5701 [openfire] [3.5.1] Address[10.12.3.44]:5701 is STARTED
2016.08.23 07:24:12 WARN  [ClusterManager events dispatcher]: org.jivesoftware.openfire.cluster.ClusterManager - Null value is not allowed!
java.lang.NullPointerException: Null value is not allowed!
       	at com.hazelcast.util.Preconditions.checkNotNull(Preconditions.java:41)
       	at com.hazelcast.map.impl.proxy.MapProxySupport.putAllInternal(MapProxySupport.java:862)
       	at com.hazelcast.map.impl.proxy.MapProxyImpl.putAll(MapProxyImpl.java:309)
       	at org.jivesoftware.openfire.plugin.util.cache.ClusteredCache.putAll(ClusteredCache.java:129)
       	at org.jivesoftware.util.cache.CacheFactory.joinedCluster(CacheFactory.java:738)
       	at org.jivesoftware.openfire.cluster.ClusterManager$2.run(ClusterManager.java:95)
2016.08.23 07:24:35 ERROR [socket_c2s-thread-3]: org.jivesoftware.openfire.spi.RoutingTableImpl - Primary packet routing failed
java.lang.ClassCastException: org.jivesoftware.util.cache.DefaultCache cannot be cast to org.jivesoftware.openfire.plugin.util.cache.ClusteredCache
       	at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.getLock(ClusteredCacheFactory.java:444)
       	at org.jivesoftware.util.cache.CacheFactory.getLock(CacheFactory.java:415)
       	at org.jivesoftware.openfire.spi.RoutingTableImpl.getRoutes(RoutingTableImpl.java:823)
       	at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToBareJID(RoutingTableImpl.java:539)
       	at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableImpl.java:308)
       	at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:244)
       	at org.jivesoftware.openfire.MessageRouter.route(MessageRouter.java:139)
       	at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:83)
       	at org.jivesoftware.openfire.net.StanzaHandler.processMessage(StanzaHandler.java:378)
       	at org.jivesoftware.openfire.net.ClientStanzaHandler.processMessage(ClientStanzaHandler.java:113)
       	at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:232)
       	at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:199)
       	at org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandler.java:181)
       	at org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:690)
       	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
       	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
       	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
       	at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109)
       	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
       	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
       	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
       	at org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:407)
       	at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:236)
       	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
       	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
       	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
       	at org.apache.mina.core.filterchain.IoFilterEvent.fire(IoFilterEvent.java:74)
       	at org.apache.mina.core.session.IoEvent.run(IoEvent.java:63)
       	at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTask(OrderedThreadPoolExecutor.java:769)
       	at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTasks(OrderedThreadPoolExecutor.java:761)
       	at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.run(OrderedThreadPoolExecutor.java:703)
       	at java.lang.Thread.run(Thread.java:745)
2016.08.23 07:24:35 ERROR [socket_c2s-thread-3]: org.jivesoftware.openfire.nio.ConnectionHandler - Closing connection due to error while processing message: <message id="b10f9e39-4e5f-413f-9b13-2e5463a5b09d" to="user20@xmpp" type="chat"><body>bar</body></message>

[Comment from Nathan Neulinger @ https://issues.igniterealtime.org/browse/HZ-3?focusedCommentId=25103]

FYI, in case it helps you any - I found that a "fast" restart of openfire causes this every time, but if you leave it down for a while (over 30 seconds by default I think) between 3-4, 8-9 - it is more generally able to come back online.

I thought I had opened a jira issue on this, but can't seem to find it now.