Does the watch steam handle disconnects in V2?

Question

Does the watch steam handle disconnects in V2?

Closed this issue 2 years ago · 13 comments

First of all, thank you for a super useful and well-implemented library.
I have a question regarding Watch.Stream.
In V1, the function watch_and_stream would automatically handle timeouts from the Kubernetes api, looking at current implementations it is a bit unclear if we should handle this ourselves or if the library somehow does this.

Answer 1 · 2023-02-08T15:36:33.000Z

Hi @Hanspagh
It definitely should. If it doesn't, I'd consider it a bug. Do you have indication that it doesn't or you just wanna make sure?

Answer 2 · 2023-02-13T09:00:16.000Z

I do not have something concrete, just the indications that after a while (half a day), I do not seem to be getting any new events. I am using Kubernetes in azure, we have noticed in python we need to handle timeout explicitly since they would shut down our watchers. So was just wondering if the same could be happening here

Answer 3 · 2023-02-13T09:43:36.000Z

Half a day? Very strange... But I'm gonna have a look...

Answer 4 · 2023-02-13T10:00:06.000Z

I have enabled debug logging to see, if I can get a bit more information, I know the kubenetes api sets a semi-random timeout on all the watch requests to spread out the load
https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/ ---min-request-timeout.

In V1 I could see you handle the timeout explicitly here and I couldn't seem to find the corresponding code in V2, that is what got me wondering

Answer 5 · 2023-02-13T12:36:45.000Z

Hey @Hanspagh, I've tried to reproduce this on a local cluster. See my findings:

Steps

I have set up a local cluster using kind.
Using docker exec I've updated /etc/kubernetes/manifests/kube-apiserver.yaml, adding - --min-request-timeout=5 to the container command.
restarted the docker container
Started a watch (with debug statements).

Behaviour

Every 6-7s, the watcher receives a BOOKMARK, followed by a :done. :done Is sent by the Mint adapter to signal that the request ended (in this case, probably because of a timeout). Upon receiving a :done, the watcher resumes the watch.

Conclusion

I have to assume the watcher works as expected.

Now what...

I'm still curious however about the reason why you stop receiving events. Does the debug logging shed any light?

I do not seem to be getting any new events.

Is the process running the stream still alive at that point?

I am using Kubernetes in azure

In what version?

Answer 6 · 2023-02-13T14:18:50.000Z

Thank you for investigating this, I will report back once this happens again, it should not be more than a day.

I am using Flow on top of the stream to capture updates over a time window, so maybe something in there is a cause of the problems.

Answer 7 · 2023-02-13T16:12:33.000Z

Could it be the api-server going away? Because of maintenance or so... but that would not occur on a daily basis I guess...

Answer 8 · 2023-02-15T09:57:52.000Z

Enabled debug logging, now streams have been running fine for 48 hours. This might have been an azure Kubernetes thing after all. Sorry for the inconvenience. Will comment or reopen if I at some point find out what was the cause of this

Answer 9 · 2023-02-15T19:56:19.000Z

I'm gonna have to re-open this. It was there directly in front of me and I did not see it. But when the server goes away (in my test I can simulate this by restarting the docker container running the cluster), k8s doesn't always recognise this. Since the connection is cached it tries to make requests using the same (closed) connection. Or worse, the stream just stays open (like in your case).

Answer 10 · 2023-02-15T22:40:36.000Z

Ahh, interesting, I guess that could have happened for my cluster.

Answer 11 · 2023-02-16T14:43:32.000Z

I'm pretty sure about it. I'm trying to fix this, but I'm struggling and currently have limited time. But it's an important requirement for Bonny.

Answer 12 · 2023-02-16T15:36:25.000Z

OK I think I have working code. Starting a PR soon.

Answer 13 · 2023-02-16T20:20:03.000Z

This should be fixed in 2.0.3. @Hanspagh could you please keep me posted? Thanks!