dotnet/orleans

Silo won't recover

pablo-salad opened this issue · 10 comments

Context:

  • We are running a heterogeneous Orleans cluster with a total of 4 silos (2 silos of each configuration).
  • The cluster has been active for approximately 18 hours.
  • We are using the latest NuGet packages for all Orleans and OrleansContrib dependencies.
  • The membership table is hosted in Kubernetes and is accurate, showing 4 active silos and some old, defunct silos (still within the DefunctSiloExpiration period).
  • Silo 10.216.2.133 began exhibiting failure behavior, where:
    • Messages originating from this silo fail.
      Orleans.Runtime.OrleansException: Current directory at S10.216.2.133:11111:86745024 is not stable to perform the lookup for grainId user/05f621896f744c099f6136809969d981 (it maps to S10.216.3.135:11111:86731531, which is not a valid silo). Retry later.
    • Messages destined for this silo time out.
      Response did not arrive on time in 00:00:30 for message: Request [S10.216.3.136:11111:86745026 sys.svc.dir.cache-validator/10.216.3.136:11111@86745026]->[S10.216.2.133:11111:86745024 sys.svc.dir.cache-validator/10.216.2.133:11111@86745024] Orleans.Runtime.IRemoteGrainDirectory.LookUpMany(System.Collections.Generic.List1[System.ValueTuple2[Orleans.Runtime.GrainId,System.Int32]]) #1865509. . About to break its promise.

Steps Taken:

  • After identifying the issue with silo 10.216.2.133, we stopped the silo.
  • Once the silo was stopped, the remaining silos picked up the ring change, the cluster resumed normal operations immediately.

Questions/Concerns:

  • How did the cluster end up in this state?
    • We observed the "retry later" error message in the logs for the failing silo but are unsure why the silo did not recover on its own.
  • Is there something missing in our health checks?
    • We want to know if there is a configuration or mechanism we are missing that could detect and prevent a failing silo from continuing to attempt message processing.

Do you see anything logs mentioning LocalSiloHealthMonitor? If so, they may be useful. Are you able to post your configuration, particularly ClusterMembershipOptions? It's logged during startup.

We have some messages like this:
.NET Thread Pool is exhibiting delays of 1.0037756s. This can indicate .NET Thread Pool starvation, very long .NET GC pauses, or other runtime or machine pauses.

The only non default value in ClusterMembershipOptions is DefunctSiloExpiration = 1 day

Ok, that Thread Pool delay is an indication that application performance is severely degraded for some reason. Profiling or a memory dump might help to indicate what. Can you tell me more about your scenario? Does your app log at a very high rate? Where are you running your app? If kubernetes, what resources are provided to the pods and do you have CPU limits set?