Scripts not resilient to gateway restarts

Here the script finds one gateway

zeebe-chaos/chaos-workers/chaos-experiments/scripts/start-instance-on-partition-with-version.sh

Line 19 in aee26dc

pod=$(getGateway)

And then it tries to exec into the gateway

zeebe-chaos/chaos-workers/chaos-experiments/scripts/start-instance-on-partition-with-version.sh

Line 31 in aee26dc

    
           processInstanceKey=$(kubectl exec "$pod" -n "$namespace" -- zbctl create instance "$processId" --version "$requiredDeploymentVersion" --insecure)

But between execution of these two lines, the gateway pod was terminated and a new pod was started to replace it. But the script tried to access the terminated gateway and eventually timeouts, failing the experiment.

Might make sense to use the service to be more resilient. Or use the helper retryUntilSuccess as we do here

zeebe-chaos/chaos-workers/chaos-experiments/scripts/start-instance-on-partition-with-version.sh

Line 38 in aee26dc

retryUntilSuccess startInstancesOnPartition

2022-04-21 04:34:21.442 CEST
chaos-worker
An instance where this happened:

"++ kubectl exec zeebe-gateway-c7fdf4f5c-v7mzz -n 0b25276f-1113-4627-9c17-5b867256e62a-zeebe -- zbctl create instance benchmark --insecure"
Debug
2022-04-21 04:34:21.505 CEST
chaos-worker
"error: cannot exec into a container in a completed pod; current phase is Failed"

The pod zeebe-gateway-c7fdf4f5c-v7mzz was terminated before this time.

Or use the helper retryUntilSuccess as we do here

It is already using retryUntilSuccess. The problem is it is retying to connect to the same terminated gateway.

Yeah because getGateway is not included in the loop.

Why don't we execute zbctl on the chaos worker? We have the authenticationDetails for the cluster available in the process variables.

Currently, it is independent of where and against what it is executed. Local, helm, cloud/saas etc.

I think it is no longer an issue, if we experience an issue the zbchaos worker will restart and retry later. Gateways are chosen random #297