OrleansContrib/Orleans.Clustering.Kubernetes

Pod crash on startup in StatefulSet with Consul Clustering

Opened this issue · 2 comments

Version 3.4.3.

siloBuilder
     .UseConsulClustering(opt =>
     {
         opt.Address = new Uri(AppConfig.Orleans.ConsulUrl);
         opt.AclClientToken = AppConfig.Orleans.AclClientToken;
     })
     .UseKubernetesHosting();

I configured the labels and environment variables for my POD accordingly to the doc.

          - name: ORLEANS_SERVICE_ID #Required by Orleans 
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['orleans/serviceId']
          - name: ORLEANS_CLUSTER_ID #Required by Orleans 
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['orleans/clusterId']
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['statefulset.kubernetes.io/pod-name']
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          - name: POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP

Running Orleans in K8S StatefulSet, my CI tool deploys the K8S StatefulSet, and then it crashes on startup.

System.AggregateException: One or more errors occurred. (Failed to get ping responses from 1 of 1 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Failed to get response from: [S10.18.123.218:11111:361110184])
 ---> Orleans.Runtime.MembershipService.OrleansClusterConnectivityCheckFailedException: Failed to get ping responses from 1 of 1 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Failed to get response from: [S10.18.123.218:11111:361110184]
   at Orleans.Runtime.MembershipService.MembershipAgent.ValidateInitialConnectivity()
   at Orleans.Runtime.MembershipService.MembershipAgent.BecomeActive()
   at Orleans.Runtime.MembershipService.MembershipAgent.<>c__DisplayClass26_0.<<Orleans-ILifecycleParticipant<Orleans-Runtime-ISiloLifecycle>-Participate>g__OnBecomeActiveStart|6>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Orleans.Runtime.SiloLifecycleSubject.MonitoredObserver.OnStart(CancellationToken ct)
   at Orleans.LifecycleSubject.OnStart(CancellationToken ct)
   at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute()
   at Orleans.Runtime.Silo.StartAsync(CancellationToken cancellationToken)
   at Orleans.Hosting.SiloHost.StartAsync(CancellationToken cancellationToken)
   at Orleans.Hosting.SiloHostedService.StartAsync(CancellationToken cancellationToken)
   at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
   at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
   at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
   at UBS.OrleansServer.EntryPoint.Start() in /app/UBS/OrleansServer/EntryPoint.cs:line 102
   --- End of inner exception stack trace ---

Tried to set StatefulSet's replica to 3, all PODs crashed on startup. Even with empty consul, no key/values pre-exists before starting the PODs in StatefulSet

fail: Orleans.Runtime.MembershipService.MembershipAgent[100661]
      Failed to get ping responses from 1 of 1 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Silos which did not respond successfully are: [S10.18.123.235:11111:361177868]. Will continue attempting to validate connectivity until 06/12/2021 07:19:33. Attempt #7

After PODs restarted over and over again, finally they stablize down and all start up. Please see RESTARTS column below.

NAME                                 READY   STATUS             RESTARTS   AGE
ubs-job-dev-0                        1/1     Running            4          17m
ubs-job-dev-1                        1/1     Running            4          16m
ubs-job-dev-2                        1/1     Running            3          16m

Log says 7 silos.

ProcessTableUpdate (called from TryUpdateMyStatusGlobalOnce) membership table: 7 silos, 3 are Active, 4 are Dead, Version=<33, 31015>. All silos: [SiloAddress=S10.18.123.246:11111:361178481 SiloName=ubs-job-dev-0 Status=Active, SiloAddress=S10.18.123.199:11111:361178519 SiloName=ubs-job-dev-1 Status=Active, SiloAddress=S10.18.117.114:11111:361178416 SiloName=ubs-job-dev-2 Status=Active, SiloAddress=S10.18.117.114:11111:361178292 SiloName=ubs-job-dev-2 Status=Dead, SiloAddress=S10.18.123.199:11111:361178366 SiloName=ubs-job-dev-1 Status=Dead, SiloAddress=S10.18.123.235:11111:361177868 SiloName=ubs-job-dev-0 Status=Dead, SiloAddress=S10.18.123.246:11111:361178329 SiloName=ubs-job-dev-0 Status=Dead]

And this is how it looks in Consul:
image

There are only 3 PODs in this StatefulSet while log says 7 silos. The SiloName is the pod name, unlike ReplicaSet, pod name in StatefulSet does not change after POD restart, It seems POD cannot see others on startup, then it crashes. StatefulSet restarted the crashed POD, the newly-started POD with the same pod name is seen as a new Silo.

Are you using K8s membership via UseKubeMembership() extension method? Looks like in the examples above you are only using official Orleans libraries such as Microsoft.Orleans.OrleansConsulUtils and Microsoft.Orleans.Hosting.Kubernetes. If so you need to report this issue to the official Orleans project i.e. https://github.com/dotnet/orleans