[bug]: k8s.KubernetesException: too old resource version: xxxx (xxxx) on re-connect
rpearsondev opened this issue · 6 comments
Describe the bug
When the operator has been running for a while and the watch of a resource has failed it correctly tries to re-connect. However when it tries to re-start the watch of the entity, it can return the error k8s.KubernetesException: too old resource version: xxxx (xxxx). This is because the operator has been in a re-connecting state long enought that the currently stored resource version (in memory int the opertor) is too far behind the resource version of the resource.
To reproduce
It is difficult to reproduce because it relies on a transient error to initiate it, however here is the profile of the issue.
There has to be an initial failure in the watch of a resource. In my case it was "The response ended prematurely while waiting for the next frame from the server. (ResponseEnded)".
Then on each successive re-connection attempt it fails with "too old resource version: 68358497 (68368462)"
here is the logs filtered to only what is relevant:
timestamp [UTC] | message | error |
---|---|---|
7/16/2024, 1:05:49.052 PM | Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68358100. | |
7/16/2024, 1:06:02.432 PM | Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68358497. | |
7/16/2024, 1:06:44.608 PM | Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68357572. | |
7/16/2024, 1:06:49.407 PM | Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68358100. | |
7/16/2024, 1:07:02.387 PM | Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68358497. | |
7/16/2024, 1:07:43.839 PM | Received watch event "Bookmark" for "Hidden/(null)", last observed resource version: 68368670. | |
7/16/2024, 1:07:47.938 PM | Watcher for Hidden was terminated and is reconnecting. | |
7/16/2024, 1:07:47.943 PM | Watcher for Hidden was terminated and is reconnecting. | |
7/16/2024, 1:07:47.943 PM | Watcher for Hidden was terminated and is reconnecting. | |
7/16/2024, 1:07:47.949 PM | There was an error while watching the resource "ObscureNamespace.V1EntityR". | The response ended prematurely while waiting for the next frame from the server. (ResponseEnded) |
7/16/2024, 1:07:47.949 PM | There were 1 errors / retries in the watcher. Wait 2.456s before next attempt to connect. | |
7/16/2024, 1:07:48.030 PM | There was an error while watching the resource "ObscureNamespace.V1EntityPM". | too old resource version: 68358497 (68368462) |
7/16/2024, 1:07:48.030 PM | There were 1 errors / retries in the watcher. Wait 2.197s before next attempt to connect. | |
7/16/2024, 1:07:48.097 PM | There was an error while watching the resource "ObscureNamespace.V1EntitySRB". | too old resource version: 68358844 (68369473) |
7/16/2024, 1:07:48.097 PM | There were 1 errors / retries in the watcher. Wait 2.823s before next attempt to connect. | |
7/16/2024, 1:07:50.230 PM | Watcher for V1EntityPM was terminated and is reconnecting. | |
7/16/2024, 1:07:50.232 PM | There was an error while watching the resource "ObscureNamespace.V1EntityPM". | too old resource version: 68358497 (68368462) |
7/16/2024, 1:07:50.232 PM | There were 2 errors / retries in the watcher. Wait 4.481s before next attempt to connect. | |
7/16/2024, 1:07:50.403 PM | Watcher for V1EntityR was terminated and is reconnecting. | |
7/16/2024, 1:07:50.405 PM | There was an error while watching the resource "ObscureNamespace.V1EntityR". | too old resource version: 68358100 (68368631) |
7/16/2024, 1:07:50.405 PM | There were 2 errors / retries in the watcher. Wait 4.085s before next attempt to connect. | |
7/16/2024, 1:07:50.918 PM | Watcher for V1EntitySRB was terminated and is reconnecting. | |
7/16/2024, 1:07:50.920 PM | There was an error while watching the resource "ObscureNamespace.V1EntitySRB". | too old resource version: 68358844 (68369473) |
7/16/2024, 1:07:50.920 PM | There were 2 errors / retries in the watcher. Wait 4.945s before next attempt to connect. | |
7/16/2024, 1:07:54.490 PM | Watcher for V1EntityR was terminated and is reconnecting. | |
7/16/2024, 1:07:54.493 PM | There was an error while watching the resource "ObscureNamespace.V1EntityR". | too old resource version: 68358100 (68368631) |
7/16/2024, 1:07:54.493 PM | There were 3 errors / retries in the watcher. Wait 8.04s before next attempt to connect. | |
7/16/2024, 1:07:54.714 PM | Watcher for V1EntityPM was terminated and is reconnecting. | |
7/16/2024, 1:07:54.716 PM | There was an error while watching the resource "ObscureNamespace.V1EntityPM". | too old resource version: 68358497 (68368462) |
7/16/2024, 1:07:54.716 PM | There were 3 errors / retries in the watcher. Wait 8.351s before next attempt to connect. | |
7/16/2024, 1:07:55.866 PM | Watcher for V1EntitySRB was terminated and is reconnecting. | |
7/16/2024, 1:07:55.869 PM | There was an error while watching the resource "ObscureNamespace.V1EntitySRB". | too old resource version: 68358844 (68369473) |
7/16/2024, 1:07:55.869 PM | There were 3 errors / retries in the watcher. Wait 8.888s before next attempt to connect. | |
7/16/2024, 1:08:02.532 PM | Watcher for V1EntityR was terminated and is reconnecting. | |
7/16/2024, 1:08:02.534 PM | There was an error while watching the resource "ObscureNamespace.V1EntityR". | too old resource version: 68358100 (68368631) |
7/16/2024, 1:08:02.534 PM | There were 4 errors / retries in the watcher. Wait 16.606s before next attempt to connect. | |
7/16/2024, 1:08:03.066 PM | Watcher for V1EntityPM was terminated and is reconnecting. | |
7/16/2024, 1:08:03.068 PM | There was an error while watching the resource "ObscureNamespace.V1EntityPM". | too old resource version: 68358497 (68368462) |
7/16/2024, 1:08:03.068 PM | There were 4 errors / retries in the watcher. Wait 16.1s before next attempt to connect. | |
7/16/2024, 1:08:04.756 PM | Watcher for V1EntitySRB was terminated and is reconnecting. | |
7/16/2024, 1:08:04.758 PM | There was an error while watching the resource "ObscureNamespace.V1EntitySRB". | too old resource version: 68358844 (68369473) |
7/16/2024, 1:08:04.758 PM | There were 4 errors / retries in the watcher. Wait 16.796s before next attempt to connect. | |
7/16/2024, 1:08:19.143 PM | Watcher for V1EntityR was terminated and is reconnecting. | |
7/16/2024, 1:08:19.148 PM | There was an error while watching the resource "ObscureNamespace.V1EntityR". | too old resource version: 68358100 (68368631) |
7/16/2024, 1:08:19.148 PM | There were 5 errors / retries in the watcher. Wait 32.544s before next attempt to connect. | |
7/16/2024, 1:08:19.168 PM | Watcher for V1EntityPM was terminated and is reconnecting. | |
7/16/2024, 1:08:19.169 PM | There was an error while watching the resource "ObscureNamespace.V1EntityPM". | too old resource version: 68358497 (68368462) |
7/16/2024, 1:08:19.169 PM | There were 5 errors / retries in the watcher. Wait 32.427s before next attempt to connect. | |
7/16/2024, 1:08:21.555 PM | Watcher for V1EntitySRB was terminated and is reconnecting. | |
7/16/2024, 1:08:21.557 PM | There was an error while watching the resource "ObscureNamespace.V1EntitySRB". | too old resource version: 68358844 (68369473) |
7/16/2024, 1:08:21.557 PM | There were 5 errors / retries in the watcher. Wait 32.521s before next attempt to connect. | |
7/16/2024, 1:08:51.596 PM | Watcher for V1EntityPM was terminated and is reconnecting. |
It continues in this way forever. Untill the pod is restarted.
Expected behavior
On error an error the watcher should start again (regardless of if the currentVersion is too old) so that the operator doesnt stop reconciling CR's
Screenshots
No response
Additional Context
We are using the KubeOps.Operator 9.1.2 nuget package.
I have created a PR with a suggested fix: #792
Yes, although it did retry, but the version is too far off, plus you won't catch up with missing event:
k8s.KubernetesException: too old resource version: 8271646 (8271663)
at KubeOps.KubernetesClient.KubernetesClient.<>c__26`1.<WatchAsync>b__26_0(Exception ex)
at k8s.Watcher`1.CreateWatchEventEnumerator(Func`1 streamReaderCreator, Action`1 onError, CancellationToken cancellationToken)+MoveNext()
at k8s.Watcher`1.CreateWatchEventEnumerator(Func`1 streamReaderCreator, Action`1 onError, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
at KubeOps.KubernetesClient.KubernetesClient.WatchAsync[TEntity](String namespace, String resourceVersion, String labelSelector, Nullable`1 allowWatchBookmarks, CancellationToken cancellationToken)+MoveNext()
at KubeOps.KubernetesClient.KubernetesClient.WatchAsync[TEntity](String namespace, String resourceVersion, String labelSelector, Nullable`1 allowWatchBookmarks, CancellationToken cancellationToken)+MoveNext()
at KubeOps.KubernetesClient.KubernetesClient.WatchAsync[TEntity](String namespace, String resourceVersion, String labelSelector, Nullable`1 allowWatchBookmarks, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
at KubeOps.Operator.Watcher.ResourceWatcher`1.WatchClientEventsAsync(CancellationToken stoppingToken)
at KubeOps.Operator.Watcher.ResourceWatcher`1.WatchClientEventsAsync(CancellationToken stoppingToken)
warn: KubeOps.Operator.Watcher.ResourceWatcher[0]
There were 163 errors / retries in the watcher. Wait 32.255s before next attempt to connect.
KubeOps.Operator.Watcher.ResourceWatcher: Warning: There were 163 errors / retries in the watcher. Wait 32.255s before next attempt to connect.
I've also encountered this issue in my environment.
+1
@buehler Encountering the same - seems to build up over time - resolved for us by restarting the pods, which is obviously not ideal.
We are also seing this issue at all our environments