冲突登录导致的集群任务报错 Cluster task error caused by conflicting logins

Question

冲突登录导致的集群任务报错 Cluster task error caused by conflicting logins

helloworldtech1024 opened this issue 2 years ago · 6 comments

helloworldtech1024 commented 2 years ago

当我用同一个账号（例如zhangsan@qq.com/abc）在两个客户端（strophe.js）反复顶号冲突登录时，会引起集群报错。
这个报错不仅影响当前冲突的账号，还会导致这个节点瘫痪，使这个节点的全部消息收发异常。
此异常可以很容易复现，且毕现，错误信息为：
When I use the same account (e.g zhangsan@qq.com/abc) on two clients (strophe. js) repeatedly log in with conflicting, cluster errors will occur.
This error not only affects the current conflicting accounts, but also causes the openfire node to be paralyzed, causing all messages sent and received by the node to be abnormal.
This exception can be easily repeated, and is bound to occur,the error is:

2022.10.17 13:46:54 org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Failed to execute cluster task within org.jivesoftware.util.SystemProperty@46d41f17 seconds
java.util.concurrent.TimeoutException: MemberCallableTaskOperation failed to complete within 30 SECONDS. Invocation{op=com.hazelcast.executor.impl.operations.MemberCallableTaskOperation{serviceName='hz:impl:executorService', identityHash=1687865074, partitionId=-1, replicaIndex=0, callId=35822684, invocationTime=1665985584809 (2022-10-17 13:46:24.809), waitTimeout=-1, callTimeout=30000, name=openfire::cluster::executor}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=30000, firstInvocationTimeMs=1665985584840, firstInvocationTime='2022-10-17 13:46:24.840', lastHeartbeatMillis=1665985610034, lastHeartbeatTime='2022-10-17 13:46:50.034', target=[10.201.1.12]:5701, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}
  at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newTimeoutException(InvocationFuture.java:68) ~[hazelcast-3.12.5.jar!/:?]
  at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:202) ~[hazelcast-3.12.5.jar!/:?]
  at com.hazelcast.util.executor.DelegatingFuture.get(DelegatingFuture.java:88) ~[hazelcast-3.12.5.jar!/:?]
  at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doSynchronousClusterTask(ClusteredCacheFactory.java:459) [hazelcast-2.5.0.jar!/:?]
  at org.jivesoftware.util.cache.CacheFactory.doSynchronousClusterTask(CacheFactory.java:736) [xmppserver-4.6.7.jar:4.6.7]
  at org.jivesoftware.openfire.plugin.session.RemoteSession.doSynchronousClusterTask(RemoteSession.java:194) [hazelcast-2.5.0.jar!/:?]
  at org.jivesoftware.openfire.plugin.session.RemoteSession.isClosed(RemoteSession.java:138) [hazelcast-2.5.0.jar!/:?]
  at org.jivesoftware.openfire.plugin.session.RemoteSessionTask.run(RemoteSessionTask.java:97) [hazelcast-2.5.0.jar!/:?]
  at org.jivesoftware.openfire.plugin.session.ClientSessionTask.run(ClientSessionTask.java:70) [hazelcast-2.5.0.jar!/:?]
  at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$CallableTask.call(ClusteredCacheFactory.java:591) [hazelcast-2.5.0.jar!/:?]
  at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_121]
  at com.hazelcast.executor.impl.DistributedExecutorService$CallableProcessor.run(DistributedExecutorService.java:270) [hazelcast-3.12.5.jar!/:?]
  at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227) [hazelcast-3.12.5.jar!/:?]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
  at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
  at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64) [hazelcast-3.12.5.jar!/:?]
  at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80) [hazelcast-3.12.5.jar!/:?]

Answer 1 · 2022-10-17T09:11:17.000Z

Thank you for your report. I have some additional questions:

Which version of Openfire are you using?

Which version of the Hazelcast plugin are you using?

Does this problem also occur with other clients? Does it occur with clients that do not use HTTP-Bind/BOSH or websockets?

What is the setting of the "Resource Policy" that you use (see screenshot)?

Answer 2 · 2022-10-17T09:43:56.000Z

Which version of Openfire are you using?
【4.6.7】and【4.7.3】 both versions will occur.

Which version of the Hazelcast plugin are you using?
【2.5.0】and【2.6.0】corresponding to the above openfire version.

Does this problem also occur with other clients?
No other client has been tried, I'll try 【smack】 later.

Does it occur with clients that do not use HTTP-Bind/BOSH or websockets?
websockets

What is the setting of the "Resource Policy" that you use (see screenshot)?
【Always kick】 I set. It seems that errors still existing after changing the setting to 【Assign kick value = 5】. I can't remember clearly

Answer 3 · 2022-10-17T09:45:30.000Z

Thank you for your report. I have some additional questions:

Which version of Openfire are you using?

Which version of the Hazelcast plugin are you using?

Does this problem also occur with other clients? Does it occur with clients that do not use HTTP-Bind/BOSH or websockets?

What is the setting of the "Resource Policy" that you use (see screenshot)?

see above

Answer 4 · 2022-10-17T09:49:31.000Z

Does it occur with clients that do not use HTTP-Bind/BOSH or websockets?
I use websockets, http://ip:7070/ws/

Answer 5 · 2022-10-17T10:33:50.000Z

I am having trouble reproducing this problem. I have a cluster of three Openfire nodes. I am using strophe clients, that log in to two different cluster nodes at the same time, using the same username, password and resource. The last client to log in always seems to kick the previous client, which is intended. I do not see stack traces in the log file.

Answer 6 · 2022-10-18T02:57:14.000Z

https://www.bilibili.com/video/BV1ee4y1m7kf/?vd_source=7df86661fecef0bfbae11b1b8d74bc9c
I recorded a video to reproduce this exception,.
at 00:09, full jid is 【c02023020@10.201.2.88/sdk】,everything works well now,
at 00:22, open a new tab in the browser, log in with a same full jid, js scripts conflict in two tabs. the full jid is conflict, strophe client disconnects, but my js scripts will drive strophe client to reconnect, so you can see a lot of WS in F12,
then at 01:37, the server exception appears, and now the current node is unavailable, and it affects the console at 01:49, and all messages sent and received by the node to be abnormal