Pod crash on startup in StatefulSet with Consul Clustering
Opened this issue · 2 comments
Version 3.4.3.
siloBuilder
.UseConsulClustering(opt =>
{
opt.Address = new Uri(AppConfig.Orleans.ConsulUrl);
opt.AclClientToken = AppConfig.Orleans.AclClientToken;
})
.UseKubernetesHosting();
I configured the labels and environment variables for my POD accordingly to the doc.
- name: ORLEANS_SERVICE_ID #Required by Orleans
valueFrom:
fieldRef:
fieldPath: metadata.labels['orleans/serviceId']
- name: ORLEANS_CLUSTER_ID #Required by Orleans
valueFrom:
fieldRef:
fieldPath: metadata.labels['orleans/clusterId']
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.labels['statefulset.kubernetes.io/pod-name']
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
Running Orleans in K8S StatefulSet, my CI tool deploys the K8S StatefulSet, and then it crashes on startup.
System.AggregateException: One or more errors occurred. (Failed to get ping responses from 1 of 1 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Failed to get response from: [S10.18.123.218:11111:361110184])
---> Orleans.Runtime.MembershipService.OrleansClusterConnectivityCheckFailedException: Failed to get ping responses from 1 of 1 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Failed to get response from: [S10.18.123.218:11111:361110184]
at Orleans.Runtime.MembershipService.MembershipAgent.ValidateInitialConnectivity()
at Orleans.Runtime.MembershipService.MembershipAgent.BecomeActive()
at Orleans.Runtime.MembershipService.MembershipAgent.<>c__DisplayClass26_0.<<Orleans-ILifecycleParticipant<Orleans-Runtime-ISiloLifecycle>-Participate>g__OnBecomeActiveStart|6>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at Orleans.Runtime.SiloLifecycleSubject.MonitoredObserver.OnStart(CancellationToken ct)
at Orleans.LifecycleSubject.OnStart(CancellationToken ct)
at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute()
at Orleans.Runtime.Silo.StartAsync(CancellationToken cancellationToken)
at Orleans.Hosting.SiloHost.StartAsync(CancellationToken cancellationToken)
at Orleans.Hosting.SiloHostedService.StartAsync(CancellationToken cancellationToken)
at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
at UBS.OrleansServer.EntryPoint.Start() in /app/UBS/OrleansServer/EntryPoint.cs:line 102
--- End of inner exception stack trace ---
Tried to set StatefulSet's replica
to 3, all PODs crashed on startup. Even with empty consul, no key/values pre-exists before starting the PODs in StatefulSet
fail: Orleans.Runtime.MembershipService.MembershipAgent[100661]
Failed to get ping responses from 1 of 1 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Silos which did not respond successfully are: [S10.18.123.235:11111:361177868]. Will continue attempting to validate connectivity until 06/12/2021 07:19:33. Attempt #7
After PODs restarted over and over again, finally they stablize down and all start up. Please see RESTARTS column below.
NAME READY STATUS RESTARTS AGE
ubs-job-dev-0 1/1 Running 4 17m
ubs-job-dev-1 1/1 Running 4 16m
ubs-job-dev-2 1/1 Running 3 16m
Log says 7 silos.
ProcessTableUpdate (called from TryUpdateMyStatusGlobalOnce) membership table: 7 silos, 3 are Active, 4 are Dead, Version=<33, 31015>. All silos: [SiloAddress=S10.18.123.246:11111:361178481 SiloName=ubs-job-dev-0 Status=Active, SiloAddress=S10.18.123.199:11111:361178519 SiloName=ubs-job-dev-1 Status=Active, SiloAddress=S10.18.117.114:11111:361178416 SiloName=ubs-job-dev-2 Status=Active, SiloAddress=S10.18.117.114:11111:361178292 SiloName=ubs-job-dev-2 Status=Dead, SiloAddress=S10.18.123.199:11111:361178366 SiloName=ubs-job-dev-1 Status=Dead, SiloAddress=S10.18.123.235:11111:361177868 SiloName=ubs-job-dev-0 Status=Dead, SiloAddress=S10.18.123.246:11111:361178329 SiloName=ubs-job-dev-0 Status=Dead]
And this is how it looks in Consul:
There are only 3 PODs in this StatefulSet while log says 7 silos. The SiloName
is the pod name, unlike ReplicaSet, pod name in StatefulSet does not change after POD restart, It seems POD cannot see others on startup, then it crashes. StatefulSet restarted the crashed POD, the newly-started POD with the same pod name is seen as a new Silo.
Are you using K8s membership via UseKubeMembership()
extension method? Looks like in the examples above you are only using official Orleans libraries such as Microsoft.Orleans.OrleansConsulUtils
and Microsoft.Orleans.Hosting.Kubernetes
. If so you need to report this issue to the official Orleans project i.e. https://github.com/dotnet/orleans