buehler/dotnet-operator-sdk

System.IO.IOException: The request was aborted in the KubeOps.Operator.Kubernetes.ResourceWatcher

Closed this issue · 11 comments

Describe the bug
The app stops watching designated resources after encountering the System.IO.IOException in the KubeOps.Operator.Kubernetes.ResourceWatcher.

To Reproduce
It happens after running a pod for a while. Conditions are not clear to me.

Expected behavior
Pod handles transient exceptions and continues to watch designated resources.

Additional context
The callstack:

fail: KubeOps.Operator.Kubernetes.ResourceWatcher[0]
      There was an error while watching the resource "MyResource".
      System.IO.IOException: The request was aborted.
       ---> System.IO.IOException: The response ended prematurely while waiting for the next frame from the server.
         at System.Net.Http.Http2Connection.<ReadFrameAsync>g__ThrowMissingFrame|57_1()
         at System.Net.Http.Http2Connection.ReadFrameAsync(Boolean initialFrame)
         at System.Net.Http.Http2Connection.ProcessIncomingFramesAsync()
         --- End of inner exception stack trace ---
         at System.Net.Http.Http2Connection.ThrowRequestAborted(Exception innerException)
         at System.Net.Http.Http2Connection.Http2Stream.CheckResponseBodyState()
         at System.Net.Http.Http2Connection.Http2Stream.TryReadFromBuffer(Span`1 buffer, Boolean partOfSyncRead)
         at System.Net.Http.Http2Connection.Http2Stream.ReadDataAsync(Memory`1 buffer, HttpResponseMessage responseMessage, CancellationToken cancellationToken)
         at k8s.LineSeparatedHttpContent.CancelableStream.ReadAsync(Byte[] buffer, Int32 offset, Int32 count, CancellationToken cancellationToken)
         at System.IO.StreamReader.ReadBufferAsync(CancellationToken cancellationToken)
         at System.IO.StreamReader.ReadLineAsyncInternal()
         at k8s.Watcher`1.<>c.<CreateWatchEventEnumerator>b__21_1[TR](Task`1 t)
         at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
         at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
      --- End of stack trace from previous location ---
         at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
      --- End of stack trace from previous location ---
         at k8s.Watcher`1.CreateWatchEventEnumerator(Func`1 streamReaderCreator, Action`1 onError, CancellationToken cancellationToken)+MoveNext()
         at k8s.Watcher`1.CreateWatchEventEnumerator(Func`1 streamReaderCreator, Action`1 onError, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
         at k8s.Watcher`1.WatcherLoop(CancellationToken cancellationToken)
         at k8s.Watcher`1.WatcherLoop(CancellationToken cancellationToken)

Edit:
KubeOps 6.6.0
k8s 1.21.14-gke.700

It is happening for me as well.
In my case im running:
KubeOps 6.0.18
K8s 1.22.9

I've double checked and this issue is only happening on my AKS clusters, please take a look at this issue at official k8s dotnet client repo Watch Method stops watching. A workaround is suggested at the mentioned thread, Ill try it myself and come back with the results.

@alexander-klang Is by any chance your environment running on AKS as well?

@Crespalvaro I'm running on GKE (also added version info to the original post). Checked the workaround - please let us know if it works!

Hi @alexander-klang.

I've been doing multiple tests regarding the subject and unfortunatily I cannot come up with a definitive answer. However, I will still post here the results of my research so maybe we can arrange a solution together.

First of all, the hack suggested on the official c# client gh is only a mitigation strategy, It wont completely fix this issue, and so was for me. I've seen as well that this hack was integrated at version 7.0.13 and removed in version 7.1.x apparently in favour of a better strategy using k8s.Autorest as replacement for Microsoft.Rest. Thus, I dont think we should keep looking further in that direction.

I've also played a bit with the timeout configuration and reconcile requeue interval, seems that it helped although I'm still facing the issue.

Finally, I digged onto this repository code and find out that its already dealing with not only server closing the connection but also any exception thrown ResourceWatcher.OnException. This disposes the K8s Watcher and recreates it. I'm confused because OnException reconnecting with backoff strategy and OnClose does purely the same. The only differences I see on the code is that _reconnectHandler awaits WatchResource whereas OnClose calls RestartWatcher without await, thus it wont wait even though inside RestartWatcher it is as well calling await WatchResource

Find below some traces of the 2 cases (keep in mind traces are captured on grafana for in deep tracking thus they are upside down):

Server Close Connection

  |   | 2022-11-02 16:23:40 | Execute/Reconcile event "Reconcile" on resource "*****".
  |   | 2022-11-02 16:23:40 | dbug: KubeOps.Operator.Controller.ManagedResourceController[0]
  |   | 2022-11-02 16:23:40 | Mapped watch event to "Reconcile" for "*****"
  |   | 2022-11-02 16:23:40 | trce: KubeOps.Operator.Controller.ManagedResourceController[0]
  |   | 2022-11-02 16:23:40 | Mapping cache result "Other" for "******".
  |   | 2022-11-02 16:23:40 | trce: KubeOps.Operator.Controller.ManagedResourceController[0]
  |   | 2022-11-02 16:23:40 | Mapping watcher event "Added" for "*******".
  |   | 2022-11-02 16:23:40 | trce: KubeOps.Operator.Controller.ManagedResourceController[0]
  |   | 2022-11-02 16:23:40 | Received watch event "Added" for "*******".
  |   | 2022-11-02 16:23:40 | trce: KubeOps.Operator.Kubernetes.ResourceWatcher[0]
  |   | 2022-11-02 16:23:40 | Restarting resource watcher for type "******".
  |   | 2022-11-02 16:23:40 | trce: KubeOps.Operator.Kubernetes.ResourceWatcher[0]
  |   | 2022-11-02 16:23:40 | The server closed the connection. Trying to reconnect.`

Conn Reset Exception Catched

  |   | 2022-11-01 09:19:44 | Connection id "0HMLR4QKDK1P0" accepted.
  |   | 2022-11-01 09:19:44 | dbug: Microsoft.AspNetCore.Server.Kestrel[39]
  |   | 2022-11-01 09:19:40 | Received watch event "Added" for "******".
  |   | 2022-11-01 09:19:40 | trce: KubeOps.Operator.Kubernetes.ResourceWatcher[0]
  |   | 2022-11-01 09:19:37 | Trying to reconnect with exponential backoff 00:00:02.9860000.
  |   | 2022-11-01 09:19:37 | info: KubeOps.Operator.Kubernetes.ResourceWatcher[0]
  |   | 2022-11-01 09:19:37 | at k8s.Watcher`1.WatcherLoop(CancellationToken cancellationToken)
  |   | 2022-11-01 09:19:37 | at k8s.Watcher`1.WatcherLoop(CancellationToken cancellationToken)
  |   | 2022-11-01 09:19:37 | at k8s.Watcher`1.CreateWatchEventEnumerator(Func`1 streamReaderCreator, Action`1 onError, CancellationToken cancellationToken)+MoveNext()
  |   | 2022-11-01 09:19:37 | --- End of stack trace from previous location ---
  |   | 2022-11-01 09:19:37 | at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
  |   | 2022-11-01 09:19:37 | --- End of stack trace from previous location ---
  |   | 2022-11-01 09:19:37 | at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
  |   | 2022-11-01 09:19:37 | at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
  |   | 2022-11-01 09:19:37 | at k8s.Watcher`1.<>c.<CreateWatchEventEnumerator>b__21_1[TR](Task`1 t)
  |   | 2022-11-01 09:19:37 | at System.IO.StreamReader.ReadLineAsyncInternal()
  |   | 2022-11-01 09:19:37 | at System.IO.StreamReader.ReadBufferAsync(CancellationToken cancellationToken)
  |   | 2022-11-01 09:19:37 | at k8s.LineSeparatedHttpContent.CancelableStream.ReadAsync(Byte[] buffer, Int32 offset, Int32 count, CancellationToken cancellationToken)
  |   | 2022-11-01 09:19:37 | at System.Net.Http.Http2Connection.Http2Stream.ReadDataAsync(Memory`1 buffer, HttpResponseMessage responseMessage, CancellationToken cancellationToken)
  |   | 2022-11-01 09:19:37 | at System.Net.Http.Http2Connection.Http2Stream.TryReadFromBuffer(Span`1 buffer, Boolean partOfSyncRead)
  |   | 2022-11-01 09:19:37 | at System.Net.Http.Http2Connection.Http2Stream.CheckResponseBodyState()
  |   | 2022-11-01 09:19:37 | at System.Net.Http.Http2Connection.ThrowRequestAborted(Exception innerException)
  |   | 2022-11-01 09:19:37 | --- End of inner exception stack trace ---
  |   | 2022-11-01 09:19:37 | at System.Net.Http.Http2Connection.ProcessIncomingFramesAsync()
  |   | 2022-11-01 09:19:37 | at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](TIOAdapter adapter, Memory`1 buffer)
  |   | 2022-11-01 09:19:37 | at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
  |   | 2022-11-01 09:19:37 | at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
  |   | 2022-11-01 09:19:37 | --- End of inner exception stack trace ---
  |   | 2022-11-01 09:19:37 | ---> System.Net.Sockets.SocketException (104): Connection reset by peer
  |   | 2022-11-01 09:19:37 | ---> System.IO.IOException: Unable to read data from the transport connection: Connection reset by peer.
  |   | 2022-11-01 09:19:37 | System.IO.IOException: The request was aborted.
  |   | 2022-11-01 09:19:37 | There was an error while watching the resource "*******".
  |   | 2022-11-01 09:19:37 | fail: KubeOps.Operator.Kubernetes.ResourceWatcher[0]

As you can see, OnClose successfully recovers whereas OnException successfully restarts the Watcher and catchs a new "Received watch event "Added"" without propagating it to EventQueue Event Mapping

It is frustrating because it seems like we are properly dealing with the error but somehow there is something else that remains broken and I could not identify it. @buehler Does this investigation gives you an idea of what could be wrong?

Hope that his is helpful and that I could explain my self in an understandable way.

We observe the same problem on our operator with KubeOps 5.4.1 as well as the latest version KubeOps 6.6.1 on Kubernetes 1.22. It is really difficult to debug as it is of such a sporadic nature, but it is a real show stopper for us as the users of our operator rely on its stability without such manual interaction as restarts. Is there any progress on the issue? Thank you very much to all who contributed to the investigation already!

This is virtually impossible to retrace. When Kubernetes (or the API of it) closes the connection, you can do nothing. However, maybe we should actively let the server crash after some retries and trigger a redeployment of the pod... but that seems to be a rather harsh treatment...

I wonder how this is done e.g. in the Golang Operator SDK? I agree it sounds like a harsh treatment, but it would still be better than having an operator that does not reconcile anymore... especially, in cases like ours where we provide the operator to be used by others on their Kubernetes cluster.

@anekdoti The best solution for us was to implement readiness and liveness probes. Basically when server closes the connection, the watcher remains broken and reconciles stop ocurring.

Everything in my prior investigation helps mitigating the ocurrences, still this health checks will restart automatically the operator whenever it happens, fixing the issue.

An implementation example could be as follows (moreover it is the one I implemented), set time/timestamp on the status of your custom resources every reconcile, health checks will basically get the resources and check if it has been over x time (in my case 10 minutes) idle (no updates). At this point you can play with failureThreshold (max retries), periodSeconds (execution frecuency), etc... check docs if needed Liveness, Readiness Probes

I hope this helps you and anyone facing this issue now or in the future.

Hi,
I think that I really found out what "terminates" the reconnect subscription. More precisely the publisher (IObserver<TimeSpan>part of the ISubject<TimeSpan>) stops emitting new values into the pipeline and the IObservable<Timespan> remains idle.
This resetting logic of the _reconnectAttempts variable resets it only in case of successful reconnections.

Occasionally (or is it rather exactly after 40 consecutively failed reconnection attempts?) the _reconnectAttempts reaches the threshold of 40 retries and then as depicted bellow an unhandled System.OverflowException is thrown from the _settings.ErrorBackoffStrategy (ExponentialBackoffStrategy) delegate:

image

After this the _reconnectHandler.OnNext(backoff); is not called anymore to retrigger a new Observable.Timer with backoff.
Here is an illustrative fix:

const int MaxRetriesAttempts = 39;

++_reconnectAttempts;

_logger.LogError(e, @"There was an error while watching the resource ""{resource}"".", typeof(TEntity));
var backoff = _settings.ErrorBackoffStrategy(_reconnectAttempts > MaxRetriesAttempts ? MaxRetriesAttempts : _reconnectAttempts);
if (backoff.TotalSeconds > _settings.WatcherMaxRetrySeconds)
{
    backoff = TimeSpan.FromSeconds(_settings.WatcherMaxRetrySeconds);
}

Of course adjust it to fit the library the best. Would it make sense to also surround some parts of the OnException method with a try catch block to ensure that the OnNext is called?

try
{
    //...
}
catch (Exception exception)
{
    _logger.LogError(exception, "Failed...");
}
finally
{
    _reconnectHandler.OnNext(backoff);
}

The end result of the above described issue is that the operator stops watching a resource of type TEntity.

Thank you in advance!

Hello @Crespalvaro and @tomasfabian !

First, thank you very much, @Crespalvaro for sharing your workaround. We were actually about to implement our interpretation of it, until @tomasfabian shared his analysis with me (thank you!). We actually had a call about it, and I think @tomasfabian is on to something - and the problem described is so subtil that I could easily imagine how it did not got caught earlier on.

What do you think (also @buehler)?

Wow..!
Thank you all for the investigative work!

I think the issue with the backoff strategy seems legit.

Hi @buehler @anekdoti @alexander-klang @Crespalvaro and everyone else!

I prepared a PR which should fix the issue regarding the stopped reconciliations.

Thank you @anekdoti for leading me toward the provided fix during our call!