zeebe-io/zeebe-chaos

Automating with zbchaos fails with time outs on gateway during deploying processes

Zelldon opened this issue · 11 comments

After restructuring the zbchaos and running against our clusters, it often fails with timeout errors on deploying a process model.

This happened when automating with Testbench, running against SaaS, but also when executing locally and running against a self-managed cluster.

{"level":"debug","logging.googleapis.com/labels":{"clusterId":"","jobKey":"2251799813688688","processInstanceKey":"2251799813688668","title":"CPU stress on an Broker"},"logging.googleapis.com/operation":{"id":"2251799813688688"},"time":"2022-12-10T14:18:30.854571594+01:00","message":"Port forward to zell-chaos-zeebe-gateway-7796d8579f-pdttw"}
{"level":"debug","logging.googleapis.com/labels":{"clusterId":"","jobKey":"2251799813688688","processInstanceKey":"2251799813688668","title":"CPU stress on an Broker"},"logging.googleapis.com/operation":{"id":"2251799813688688"},"time":"2022-12-10T14:18:30.955492888+01:00","message":"Successfully created port forwarding tunnel"}
{"level":"debug","logging.googleapis.com/labels":{"clusterId":"","jobKey":"2251799813688688","processInstanceKey":"2251799813688668","title":"CPU stress on an Broker"},"logging.googleapis.com/operation":{"id":"2251799813688688"},"time":"2022-12-10T14:18:30.955897778+01:00","message":"Deploy file bpmn/one_task.bpmn (size: 2526 bytes)."}
panic: rpc error: code = DeadlineExceeded desc = Time out between gateway and broker: Request ProtocolRequest{id=3029, subject=command-api-1, sender=10.0.5.86:26502, payload=byte[]{length=2675, hash=1668380772}} to zell-chaos-zeebe-0.zell-chaos-zeebe.zell-chaos.svc:26501 timed out in PT15S

Interesting is that when this happens with Testbench and the deployed zbchaos worker the worker retries every 5 minutes (after job timeout) and fails again until the experiment fails due to timeout. I was not able to deploy either by hand with zbctl, which is a bit suspicious to me. It looks like that there is an issue in the gateway or broker when running such experiments.

Details

As mentioned above I have run the experiment locally, against an self-managed cluster and increased the log level (Zeebe Trace, Atomic Debug).

When the experiment fail then we can see in the gateway, that the request is sent to the broker:


2022-12-10 14:18:30.940 CET

zeebe-gateway
Send request 1519360045 to zell-chaos-zeebe-0.zell-chaos-zeebe.zell-chaos.svc:26501 with topic command-api-1

(At least I hope this is the request ?!)

In the broker I see several issues polling etc between leader and followers and later having different leaders.

2 - Failed to probe Member{id=0e1d070c-36bb-4023-8b6c-706186e68289, address=zell-chaos-zeebe-0.zell-chaos-zeebe.zell-chaos.svc:26502, properties={}, version=null, timestamp=0, state=null, incarnationNumber=0}

Furthermore I see that an response can't be sent back (from broker 0, which is our leader at this time):

2022-12-10 14:18:30.654 CET

zell-chaos-zeebe-0

zeebe
Received a reply for message id:[2134] but was unable to locate the request handle

Did we try checking network traffic with Wireshark/tshark? What does it look like?

What we (@npepinpe and I ) did so far:

  • We discussed the potential error cases
    • @npepinpe made a point about whether it is written to the log
    • I checked it with zdb, the deploy command, which failed, was not on the log
      output.json.txt
    • We discussed whether dispatcher might not write the commands, @npepinpe remembers there is a weird behavior in CommandAPI https://github.com/camunda/zeebe/blob/main/broker/src/main/java/io/camunda/zeebe/broker/transport/commandapi/CommandApiRequestHandler.java#L115 If the command is not written, it is silently ignored...
    • I added a new log statement and build a docker image to run the experiments again
    • After setting up the cluster with the new changes and running the experiment again, it failed at the same step. But it looks like it is related to Leader changes, that maybe we have no leader for partition one at this moment, which causes to timeout the deploy command.
    • So with self-managed cluster it looks like that the stress experiment always caused an leader change, which caused deployment to fail on partition one. I added a pausing (which you can do in the experiment definition) in between and now I can executed it against self-managed 💪:skin-tone-3: I have to verify whether it also now works with SaaS, or whether this is a different issue.

🕵️ What I tried the last days:

Investigate Trace logs

I had a look at it together with @deepthidevaki. We enabled trace logging everywhere, I was able to see that old IP addresses have been in uses, for ~20 min. This also explains why other requests on the gateway fail, like job activation for connectors.

Details:

The gateway keeps old IP addresses and it takes some while to renew them (BTW DNS TTL is 30 min)
I have run several experiments and also one where I enabled trace logging everywhere.

When the experiment failed I checked the IPs of the broker pods:

[cqjawa zeebe-io/ cluster: ultrachaos ns:9d2a5bfa-2a8d-4e09-b5d0-b2b471b75941-zeebe]$ k get pod zeebe-0 -o jsonpath="{.status.podIP}"
10.56.18.91[cqjawa zeebe-io/ cluster: ultrachaos ns:9d2a5bfa-2a8d-4e09-b5d0-b2b471b75941-zeebe]$ k get pod zeebe-1 -o jsonpath="{.status.podIP}"
10.56.17.53[cqjawa zeebe-io/ cluster: ultrachaos ns:9d2a5bfa-2a8d-4e09-b5d0-b2b471b75941-zeebe]$ k get pod zeebe-2 -o jsonpath="{.status.podIP}"
10.56.39.40[cqjawa zeebe-io/ cluster: ultrachaos ns:9d2a5bfa-2a8d-4e09-b5d0-b2b471b75941-zeebe]$
The leader for partition one is Broker 1:
[cqjawa zeebe-io/ cluster: ultrachaos ns:9d2a5bfa-2a8d-4e09-b5d0-b2b471b75941-zeebe]$ k exec zeebe-gateway-5f77b56896-g7qxx -- zbctl status --insecure
Defaulted container "zeebe-gateway" out of: zeebe-gateway, debugger (ephem)
Cluster size: 3
Partitions count: 3
Replication factor: 3
Gateway version: 8.2.0-SNAPSHOT
Brokers:
  Broker 0 - zeebe-0.zeebe-broker-service.9d2a5bfa-2a8d-4e09-b5d0-b2b471b75941-zeebe.svc.cluster.local:26501
    Version: 8.2.0-SNAPSHOT
    Partition 1 : Follower, Healthy
    Partition 2 : Follower, Healthy
    Partition 3 : Follower, Healthy
  Broker 1 - zeebe-1.zeebe-broker-service.9d2a5bfa-2a8d-4e09-b5d0-b2b471b75941-zeebe.svc.cluster.local:26501
    Version: 8.2.0-SNAPSHOT
    Partition 1 : Leader, Healthy
    Partition 2 : Leader, Healthy
    Partition 3 : Follower, Healthy
  Broker 2 - zeebe-2.zeebe-broker-service.9d2a5bfa-2a8d-4e09-b5d0-b2b471b75941-zeebe.svc.cluster.local:26501
    Version: 8.2.0-SNAPSHOT
    Partition 1 : Follower, Healthy
    Partition 2 : Follower, Healthy
    Partition 3 : Leader, Healthy

We can see in the logs that is connecting to the wrong IPs for Zeebe 1
https://console.cloud.google.com/logs/query;query=resource.type%3D%22k8s_container%22%0Ar[…]301Z;resultsSearch=10.56.17?project=camunda-cloud-240911

Furthermore, I can see in the heap dump of the gateway several old IP's lurking around. I think it is related to that, and if the IP is updated again at around 22:22:09 we also can see in the metrics that it starts to process again since traffic is reaching the partition again. The partition is dead for almost ~20-30 minutes, which is also reproducible with the chaos experiments.

wrongips
oldips

Disable some experiment

I tried to disable the multiple leader experiment, to see whether it runs through but it is still failing. It interesting is also that I have introduced randomness in choosing the gateway, and even that doesn't resolve the issue, but I guess this makes sense since both gateways will have the same issue.

Integration Tests

I created a new cluster in SaaS (967418c4-dd62-4230-939b-0597897d8685) and tried to run my integration test against that.

In order to do that I need to switch my k8 context and k8 namespace to the created cluster, then I can run in that context run the experiment automated with the integration setup I have created.

It was reproduced 💪 Keep in mind that I was not able to reproduce this in self-managed after fixing the issue with the stress experiment.

Looks like an issue with the SaaS infra setup 🤔

Next

I tried to rerun the experiment, but looks like my cluster is unhealthy I need to check whether this is related to my executions.

Try to run it again, whether it is really reproducible and ping SRE if this is the case.

Potentially we could overcome this to not always pick leader for partition one, on the chaos experiments for now.

Impressive investigation 🚀

I'm surprised we've never had issues with this in prod, where the gateway fails to contact certain brokers because of this 🤔

One workaround could be to use IP addresses instead of hostnames. We want hostnames for configuration purposes (so users don't have to reconfigure everything every time an IP changes), but perhaps nodes could advertise their current IP address in gossip, such that the latest IP is always picked up eventually regardless of DNS issues? 🤔 This might not work with nodes behind reverse proxies though :(

I have run the experiment again and checked the whether DNS resolving is an issue:

$ k exec -it zeebe-gateway-6d79bf8db4-56bkl -- zbctl --insecure status
Defaulted container "zeebe-gateway" out of: zeebe-gateway, debugger-ghlwh (ephem)
Cluster size: 3
Partitions count: 3
Replication factor: 3
Gateway version: 8.1.5
Brokers:
  Broker 0 - zeebe-0.zeebe-broker-service.967418c4-dd62-4230-939b-0597897d8685-zeebe.svc.cluster.local:26501
    Version: 8.1.5
    Partition 1 : Follower, Healthy
    Partition 2 : Follower, Healthy
    Partition 3 : Follower, Healthy
  Broker 1 - zeebe-1.zeebe-broker-service.967418c4-dd62-4230-939b-0597897d8685-zeebe.svc.cluster.local:26501
    Version: 8.1.5
    Partition 1 : Leader, Healthy
    Partition 2 : Follower, Healthy
    Partition 3 : Follower, Healthy
  Broker 2 - zeebe-2.zeebe-broker-service.967418c4-dd62-4230-939b-0597897d8685-zeebe.svc.cluster.local:26501
    Version: 8.1.5
    Partition 1 : Follower, Healthy
    Partition 2 : Leader, Healthy
    Partition 3 : Leader, Healthy

Deployment fails again, so we have to check whether we can resolve leader for partition one, which is Broker 1. In the experiment, we port-forwarded to Port forward to zeebe-gateway-6d79bf8db4-56bkl

$ kubectl debug -it --image=registry.k8s.io/e2e-test-images/jessie-dnsutils:1.3 --target=zeebe-gateway zeebe-gateway-6d79bf8db4-56bkl
Targeting container "zeebe-gateway". If you don't see processes from this container it may be because the container runtime doesn't support this feature.
Defaulting debug container name to debugger-ghlwh.
If you don't see a command prompt, try pressing enter.
root@zeebe-gateway-6d79bf8db4-56bkl:/# 
root@zeebe-gateway-6d79bf8db4-56bkl:/# nslookup zeebe-1.zeebe-broker-service.967418c4-dd62-4230-939b-0597897d8685-zeebe.svc.cluster.local
Server:		10.0.48.10
Address:	10.0.48.10#53

Name:	zeebe-1.zeebe-broker-service.967418c4-dd62-4230-939b-0597897d8685-zeebe.svc.cluster.local
Address: 10.56.22.117

Also from the pod status, we can see the same IP:

$ k get pod zeebe-1 -o jsonpath="{.status.podIP}"
10.56.22.117

In order to verify the state of the gateway I did a heap dump:

$ kubectl debug -it -c debugger --image=eclipse-temurin:17-jdk-focal --target=zeebe-gateway zeebe-gateway-6d79bf8db4-56bkl -- /bin/bash
Targeting container "zeebe-gateway". If you don't see processes from this container it may be because the container runtime doesn't support this feature.
If you don't see a command prompt, try pressing enter.
root@zeebe-gateway-6d79bf8db4-56bkl:/# jps
7 StandaloneGateway
621 Jps
root@zeebe-gateway-6d79bf8db4-56bkl:/# jmap -dump:live,format=b,file=/usr/local/zeebe/data/heap.dump 7
Dumping heap to /usr/local/zeebe/data/heap.dump ...
Heap dump file created [43820610 bytes in 1.222 secs]

Copying the heap dump

$ k cp zeebe-gateway-6d79bf8db4-56bkl:/usr/local/zeebe/data/heap.dump ./heap.dump
Defaulted container "zeebe-gateway" out of: zeebe-gateway, debugger-ghlwh (ephem), debugger (ephem)
tar: Removing leading `/' from member names
[cqjawa debug/ cluster: ultrachaos ns:967418c4-dd62-4230-939b-0597897d8685-zeebe]$ ls -la
total 42804
drwxr-xr-x  2 cqjawa cqjawa     4096 Dec 14 16:02 .
drwx------ 34 cqjawa cqjawa     4096 Dec 14 16:01 ..
-rw-r--r--  1 cqjawa cqjawa 43820610 Dec 14 16:02 heap.dump

Checking the heap dump we see that there still some old IP's be used, and there is also already the new one available it seems, 10.56.22.117 vs 10.56.22.116 (old, multiple objects)
inet

A similar issue camunda/zeebe#8264 which @oleschoenburg mentioned in slack.

Investigation:

What I found out so far, is that it looks like we have two separate issues.

One is that our infrastructure has some issues with setup pods via calico (?) I have observed during executing some experiments that a Broker was not able to come up, because it had no eth0 interface configured. It was not able to reach any other node, which means also no DNS resolution was possible.

Secondly, we have the issue with the gateways described above not being able to resolve the correct IP. I have investigated this further as well.

Details: Broker doesn't come up

When running some experiments against my cluster, which looked healthy before it failed because one pod didn't come up and it looked like it can't reach the others, so I did the following checks.

$ kubectl debug -it --image=registry.k8s.io/e2e-test-images/jessie-dnsutils:1.3 --target=zeebe zeebe-2
Targeting container "zeebe". If you don't see processes from this container it may be because the container runtime doesn't support this feature.
Defaulting debug container name to debugger-jb6mn.
If you don't see a command prompt, try pressing enter.
root@zeebe-2:/# 
root@zeebe-2:/# nslookup zeebe-0.zeebe-broker-service.967418c4-dd62-4230-939b-0597897d8685-zeebe.svc.cluster.local:26501
;; connection timed out; no servers could be reached

root@zeebe-2:/# nslook up zeebe-1.zeebe-broker-service.967418c4-dd62-4230-939b-0597897d8685-zeebe.svc.cluster.local      
bash: nslook: command not found
root@zeebe-2:/# nslookup zeebe-1.zeebe-broker-service.967418c4-dd62-4230-939b-0597897d8685-zeebe.svc.cluster.local
;; connection timed out; no servers could be reached

I checked whether the nameserver is reachable

$ kubectl get svc --namespace=kube-system
NAME                            TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                        AGE
calico-typha                    ClusterIP   10.0.63.197   <none>        5473/TCP                       198d
default-http-backend            NodePort    10.0.49.249   <none>        80:30937/TCP                   224d
kube-dns                        ClusterIP   10.0.48.10    <none>        53/UDP,53/TCP                  2y132d

It was not:

root@zeebe-2:/# ping -c 3 10.0.48.10
connect: Network is unreachable

When checking the network interfaces, I saw that eth0 was not set up.

oot@zeebe-2:/# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

SRE was able to fix this by, restarting the calico nodes. Still, it is unclear why this happened. If we see that again we should ping SRE.

Details: Gateway can't talk with brokers

After the issue from above was resolved I run again our experiments against the cluster. This time I saw again the gateway issues, that it can't talk with the other brokers. The client was not able to deploy any processes.

I took again a heap dump and checked whether I can resolve the addresses on the pod. Name resolving was in general possible, but the gateway had still old IP's in the heap.

newips
gw-use-old-ips

I have investigated a bit further how the DNS TTL works in Java and checked our code a bit regarding this issue.

If found several resources, like https://stackoverflow.com/questions/1256556/how-to-make-java-honor-the-dns-caching-timeout which point out that you can configure DNS TTL, if you want but if not set it should be around 30 seconds.

I was able to prove that based on the code and heap dump as well.

Taking a look at how the addresses are cached in InetAddress and how the expiry is calculated we can see the following:

defaultCachePolicy
cachedAddresses

Furthermore, I was able to find the CachedAddresses in the heap dump I took earlier.
expiryaddresses

Since they use nanoseconds as expiry time, we can't just use the normal Unix timestamp to calculate the time. We need to use the same reference as the System.nanoTime() used when creating the timestamp. I checked for the ActorClock nanoseconds field to have a reference and to verify how far in the future this expiry would be.

actorclockns

As we can see based on the values is that the difference is ~12 seconds

expiry12sec

I have also further checked whether we reuse any IP or addresses, but we seem to generously create new Address objects anytime, and also try to resolve the addresses when connecting.

Details: Chaos Experiment changes

I tried to overcome the issue, by not always choosing the same leader. Previously in our experiments we often just used partition one leader, I thought this might be an issue to restart the node maybe too often. It is likely that the same node becomes leader again if it has the longes log 🤷

This didn't help, unfortunately.

Current State - Next

I guess I need to dig a bit deeper. I have also improved our current toString from Address to print the resolved IP (which is quite useful in debugging/tracing such issues). I will rerun QA with a docker image of that version.

Right now I feel it is an infrastructure issue, and not a code issue, since we are also easy to reproduce it in SaaS (not sure whether in self-managed I ever saw it. Have to check again.)

Just a heads up: I wrote an runbook regarding the dns troubleshooting https://github.com/camunda-cloud/runbook/pull/154

I investigated further why addresses seem to be reused. I might be have an idea, it looks like that we open channels and reuse them for a longer period of time. Interesting is that we not close the channel on timeout, which might cause the issue which we currently see. I tried to fix it with camunda/zeebe@2fcf80e

Unfortunatelty I doesn't resolve the issue yet, but I might have some more insights since I also added some more details for the logging as also described above.

What I was able to see is that:

When the deploy is sent: Deploy file bpmn/one_task.bpmn (size: 2526 bytes).

We direct it to the zeebe-1 Send request 808630568 to zeebe-1.zeebe-broker-service.ef47e566-7c38-42d1-b815-91efbb7ff317-zeebe.svc.cluster.local/:26501 with topic command-api-1

The request seem to fail later Request 808630568 failed, will not retry!

We seem to resolve the correct address, see here Activating server protocol version V2 for connection to /10.56.21.41:35654

But somehow the port is wrong ? @npepinpe do you know why this could happen ? Or is this not relevant?

We are also able to resolve the correct address here Resolved Address zeebe-broker-service/10.56.21.41:26502 to bootstrap client.

but somehow it doesn't work.

Would be happy for any input.

Just for posterity:

Hey Team,

small update from my side. Last week I tried further investigate why integration between zbchaos and testbench fails.

TL;DR; As far as I know now is that it seems to be related to the setup in SaaS, but it has to do something within the gateway which doesn't work well.

Recap: What is the problem

  • Automated experiments fail because the deployment commands are timing out. It looks like the gateway can't reach the leader of partition one, or sends it to the wrong IP.
  • It seems to be related to multiple restarts of the brokers. It looks like the gateway has still old IP's in their use. Impact: I think this can also affect normal users if we have multiple restarts due to the rescheduling of pods or preemption in general.

:male-detective::skin-tone-3: Investigation summary

No persisted Commands

  • I verified that the commands are NOT written to the Broker log. It looks like the Commands are not received by the leader.
    During failure:
  • When the experiment execution fails, also manual deployments fail with zbctl, no matter via port-forward or over ngnix via credentials.
  • In the experiment automation, I used random gateways for port-forwarding but this didn't helped either.
    Recovers:
  • It seems to recover often after ~20-30 minutes, we can see that traffic is accepted again by partition one and we can deploy again.
    Reproducing:
  • I was NOT able to reproduce this issue in our self-managed clusters.
  • I was able to always reproduce it in SaaS clusters. Right now I reused one cluster and rerun experiments to observe the issue.
  • I can use my integration tests in zbchaos to reproduce the problem, this allows to reduce the noise in zeebe-ci, and a shorter feedback loop.

Old IP's

  • I turned on trace logging in the Zeebe nodes and was able to see that the gateway still used old IP's.
  • Furthermore, the gateway heap contained old IP's.
  • Interestingly is that nslookup on a debug container resolved the names to the correct IP's. The broker-service is a headless service and also returns the correct IP's on name resolving.
    DNS caching in Java

Tried workarounds:

  • Introduced a wait in the stress experiment, this helped in the self-managed environment at the begin but not in SaaS.
  • Use different partitions for the restart/terminate experiments. Doesn't help.
  • Restarting the gateway after failure worked. This indicates to me it is an issue at the application level or configuration

Other issues:

  • During experimenting, I run into an issue with calico, where no eth0 interface was configured which caused that the broker was not able to come up.
  • I wrote a runbook for the DNS / connectivity issues

Configuration:

  • I started to investigate whether we have differences in the configuration and setup.
  • Helm Charts and SaaS use both headless services for the Broker, which resolve multiple IP's (for each broker pod). Both use the service in the gateway as cluster connection.

Code:

  • I started to investigate the NettyMessaggingService, it looks like we don't close channels on connection issues #294 (comment)

❓ Open Questions:

  • Why does it fail only in SaaS? What is the difference?
  • Different configurations? Different resources?
  • Debug container, Nslookup resolves IP? Has debug container a different view then the normal container?
  • Do we reuse IP's connections somehow in the code? May be related to the above channel not closing.
  • Does it fail on the long-running cluster?
  • Is there anything in the calico or Kube DNS logs that I have overseen?

Thanks to @deepthi and @nicolas for their input. :man-bowing::skin-tone-3: Would be happy if someone has time to spare with me regarding the NettyMessaging code.

Yesterday I worked on adding some new metrics to the NettyMessagingService https://github.com/camunda/zeebe/tree/zell-zbchaos-investigation

I have rerun the chaos experiments and we can see clearly in the metrics that for quite a while it is sending requests to the wrong IP!


The executed experiment failed with: 'Should be able to create process instances on partition two'

The zbchaos failed the first time around 2022-12-20 08:30:10.171 CET with: Expected to create process instance on partition 2, but timed out after 30s. and retried every 5 minutes until the chaos experiment failed with an timeout.

request-gw

From 8:28 until 8:40 the gateway is sending to "10.56.25.48" instead of "10.56.25.49", when the gateway is starting to use the correct IP we see that the processing goes up again. It is interesting that the responses don't show an error, might be worth investigating this further.

The calico node logs show that the IP was set up correctly at ~8:28 (please be aware of UTC timestamp):

2022-12-20 07:28:22.605 [INFO][69] felix/int_dataplane.go 1585: Received *proto.IPSetDeltaUpdate update from calculation graph msg=id:"s:J1TO0ckqzlO41yn5_FpfVzhgxiDF2WNh8uCIpw" added_members:"10.56.25.49/32"
2022-12-20 07:28:22.608 [INFO][68] felix/int_dataplane.go 1585: Received *proto.WorkloadEndpointUpdate update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"50ab3950-b7e2-4a53-8250-f7201f3196fe-zeebe/zeebe-1" endpoint_id:"eth0" > endpoint:<state:"active" name:"calie904863dde8" profile_ids:"kns.50ab3950-b7e2-4a53-8250-f7201f3196fe-zeebe" profile_ids:"ksa.50ab3950-b7e2-4a53-8250-f7201f3196fe-zeebe.default" ipv4_nets:"10.56.25.49/32" tiers:<name:"default" ingress_policies:"50ab3950-b7e2-4a53-8250-f7201f3196fe-zeebe/knp.default.zeebe-network-policy" > > 

The kube-dns logs also doesn't show any issues. We only see the following entry when cluster is setup:

Could not find endpoints for service "zeebe-broker-service" in namespace "50ab3950-b7e2-4a53-8250-f7201f3196fe-zeebe". DNS records will be created once endpoints show up.

Update:

We can see in the gateway metrics that around this time the failure rate goes up

failure-rate

\cc @npepinpe

This issue was fixed in Zeebe via camunda/zeebe#11307

We have seen in several experiments that channels have been reused even after a long period of unavailability of the other side. Meaning that the brokers were already restarted and had new IPs but gateways still have sent requests to old IPs without realizing the issue. The problem here is that we have implemented the timeout on top because as far as we can see right now the channels don't provide that feature. The channel response handling was never called, which means no action could be taken here.

Our fix is now to listen for the timeout exception in order to close the channel. This will cause us to create a new channel on the next request, which allows us to reconnect to the correct node/IP.

The fix was merged on main and the scheduled QA run succeeded.

main

After backporting to 8.1 the QA also succeeded again here

bp81

Self-managed is also running through, when executing IT locally. There is still a question why it never failed with self-managed.
run

My thought was that maybe gateway might be restarted as well in between, since we have seen restarting the gateway helps. @npepinpe mentioned that it might be related to the IP assignments, that in self-managed we get the same IP again, which works in this case in SaaS we have much more clusters and it is likely to not get the same address. At this point it is not 100% clear, but I feel it is good for now and we can close this issue.