Chaos: Disconnect zbchaos command fails with runtime error
shahamit opened this issue · 7 comments
Chaos Experiment
We tried the disconnect zbchaos command against a locally installed zeebe cluster (v - 8.1.6). The command fails with a runtime error invalid memory address or nil pointer dereference
Please find the screenshots attached with different flags. All of them lead to the same error. Kindly share some insights. Thanks.
Hey @shahamit zbchaos doesn't support local installation. The expected setup is either deployment via helm-charts in kubernetes or internally setup in our SaaS.
I will try to document this better
Can you rerun the same with verbosity enabled?
Sorry for the delayed response. It took us some time to get a distributed cluster up on aws.
We ran this test against a cluster that was under load. The config is 2 gateways, 6 brokers, 6 partitions, 2 replication factor.
The disconnect command does disconnect the gateway but this leads to errors on the client and on the gateway. The disconnect command verbose output is also something I didn't follow - It says "Gateway deployment not fully available. Available replicas 2/3'. Is this because one new gateway replica gets created by k8s when the first one got disconnected?
Overall it seems the cluster stops functioning if one gateway nodes gets disconnected, which isn't good. Thoughts?
Benchmarking tool (client side) logs
Gateway Logs
io.camunda.zeebe.gateway - Failed to activate jobs for type benchmark-task-benchmarkStarter1-completed from partition 5
java.net.ConnectException: Failed to connect channel for address dev-zeebe-4.dev-zeebe.default.svc:26501
at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$bootstrapClient$36(NettyMessagingService.java:721) ~[zeebe-atomix-cluster-8.1.6.jar:8.1.6]
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:674) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:693) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:489) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at java.lang.Thread.run(Unknown Source) ~[?:?]
2023-03-21 13:03:39.177 [ActivateJobsHandler] [gateway-scheduler-zb-actors-3] WARN
io.camunda.zeebe.gateway - Failed to activate jobs for type benchmark-task-benchmarkStarter1-completed from partition 5
java.net.ConnectException: Failed to connect channel for address dev-zeebe-4.dev-zeebe.default.svc:26501
at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$bootstrapClient$36(NettyMessagingService.java:721) ~[zeebe-atomix-cluster-8.1.6.jar:8.1.6]
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:674) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:693) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:489) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
at java.lang.Thread.run(Unknown Source) ~[?:?]
@Zelldon - there are a couple of blockers that we observed when executing the chaos tool against an under-load zeebe cluster. One of them is this issue and the other one is gateway termination logged here.
There are more failures that we observed when executing the restart gateway
chaos experiment but we thought of re-executing it once there is some analysis done on these logged ones.
Should I move these issues on the zeebe repo to gain traction since anyways there are no issues with the experiment itself but its outcome?
Thanks