zeebe-io/zeebe-chaos

Chaos: Disconnect zbchaos command fails with runtime error

shahamit opened this issue · 7 comments

Chaos Experiment

We tried the disconnect zbchaos command against a locally installed zeebe cluster (v - 8.1.6). The command fails with a runtime error invalid memory address or nil pointer dereference

Please find the screenshots attached with different flags. All of them lead to the same error. Kindly share some insights. Thanks.
Screenshot from 2023-03-14 16-08-38
Screenshot from 2023-03-14 14-53-57

Hey @shahamit zbchaos doesn't support local installation. The expected setup is either deployment via helm-charts in kubernetes or internally setup in our SaaS.

I will try to document this better

Can you rerun the same with verbosity enabled?

Sorry for the delayed response. It took us some time to get a distributed cluster up on aws.

We ran this test against a cluster that was under load. The config is 2 gateways, 6 brokers, 6 partitions, 2 replication factor.

The disconnect command does disconnect the gateway but this leads to errors on the client and on the gateway. The disconnect command verbose output is also something I didn't follow - It says "Gateway deployment not fully available. Available replicas 2/3'. Is this because one new gateway replica gets created by k8s when the first one got disconnected?

Overall it seems the cluster stops functioning if one gateway nodes gets disconnected, which isn't good. Thoughts?

Disconnect command output
ksnip_20230321-183849 (1)

Benchmarking tool (client side) logs
ksnip_20230321-183548 (1)

Gateway Logs

io.camunda.zeebe.gateway - Failed to activate jobs for type benchmark-task-benchmarkStarter1-completed from partition 5
java.net.ConnectException: Failed to connect channel for address dev-zeebe-4.dev-zeebe.default.svc:26501
        at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$bootstrapClient$36(NettyMessagingService.java:721) ~[zeebe-atomix-cluster-8.1.6.jar:8.1.6]
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:674) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:693) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:489) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at java.lang.Thread.run(Unknown Source) ~[?:?]
2023-03-21 13:03:39.177 [ActivateJobsHandler] [gateway-scheduler-zb-actors-3] WARN
      io.camunda.zeebe.gateway - Failed to activate jobs for type benchmark-task-benchmarkStarter1-completed from partition 5
java.net.ConnectException: Failed to connect channel for address dev-zeebe-4.dev-zeebe.default.svc:26501
        at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$bootstrapClient$36(NettyMessagingService.java:721) ~[zeebe-atomix-cluster-8.1.6.jar:8.1.6]
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:674) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:693) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:489) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397) ~[netty-transport-classes-epoll-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.82.Final.jar:4.1.82.Final]
        at java.lang.Thread.run(Unknown Source) ~[?:?]

@Zelldon - there are a couple of blockers that we observed when executing the chaos tool against an under-load zeebe cluster. One of them is this issue and the other one is gateway termination logged here.

There are more failures that we observed when executing the restart gateway chaos experiment but we thought of re-executing it once there is some analysis done on these logged ones.

Should I move these issues on the zeebe repo to gain traction since anyways there are no issues with the experiment itself but its outcome?

Thanks

Hey @shahamit

Should I move these issues on the zeebe repo to gain traction since anyways there are no issues with the experiment itself but its outcome?

Sounds reasonable to me, but lets first collect a bit more information in #336