Create ability to do zero downtime deployments when using externalTrafficPolicy: Local
Closed this issue Β· 46 comments
I am using externalTrafficPolicy set to Local for my a LoadBalancer service for an ingress controller on GKE.
Right now, when a pod gets terminated, it is immediately removed from the NodePort service, which stops traffic from routing to the pod (step 5 at https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods).
The problem is that the GCP Load Balancer doesn't update itself immediately, so it continues to send traffic to the NodePort even though Kubernetes has already removed the pod from the NodePort as part of the termination process. This results in timeouts and an inability to do zero downtime deployments when a node no longer has an active application residing on it when externalTrafficPolicy is set to Local.
I'd like to see an option where we can use Local, but allow for zero-downtime deployments.
I'm wondering if there could be a configurable option to wait until a preStop hook has finished (or grace period hits) before removing the pod from the NodePort service? With something like this, we could make a preStop hook that can make health checks fail but have the pod continue to serve traffic normally. The preStop hook could then sleep for a certain amount of time while the load balancers gracefully stop sending traffic because the health checks start to fail. Once the preStop hook completes, then it removes the pod from the NodePort. This would allow for graceful draining of outgoing pods.
Or maybe the answer is a pre-PreStop hook that can run before termination officially begins?
/sig network cloud-provider
/kind feature
/area provider/gcp
/triage unresolved
Comment /remove-triage unresolved
when the issue is assessed and confirmed.
π€ I am a bot run by vllry. π©βπ¬
Having the same issue. Any graceful termination in the pod itself will not help, since the port at node level stop accepting traffic.
Same issue here. I tried to fix in a few months ago but gave up in the end and set externalTrafficPolicy
to Cluster
(loosing the client ip but deploy zero-downtime). The issue is that the Google LBs don't get a signal BEFORE a pod is terminated, so only after the not configurable healthcheck timeout the node is deregistered from the LB. Found no way so far to remove the node from the LB before terminating the POD.
This is a problem with 2-hop load-balancing - the end-of-life handling of pods doesn't really have a way to describe "upstream" dependencies and sequencing. The endpoints controller sees the pod as terminating and immediately removes it from the set. Kube-proxy has no choice but to also remove it from the list of available backends. As you described, the upstream LB hasn't received the news yet.
For HTTP apps, If you use Ingress and VPC-Native LB (on GCP) you will bypass this second hop (kube-proxy) and the LB goes directly to the pod. During the terminationGracePeriod, the pod will be removed from the LB.
For apps that use Service, this remains a problem. I'd like this to be possible. It probably needs a KEP to cover the details, but maybe something like:
- Instead of removing a terminating endpoint, move it to NotReadyAddresses
- In kube-proxy, if a service has no viable ready endpoints, but has not-ready addresses, use those
For "local" services that would still use the terminating endpoint. Presumably, upstream LBs would be deconfigured and incoming traffic would taper off.
That doesn't seem egregiously complicated to me, but I bet there are corner cases. Here's one - NotReady covers both startup and teardown. Would we want LBs to go to not-yet-initialized backends at the beginning of life? It seems wrong but not terribly so. Here's another - it will add a lot of endpoints writes as we have to process both states.
I don't have much to add here, agree that it would likely take a KEP to completely solve this. One may already be in the works that could help. Although this primarily has IPVS in mind, the proposed API change here could likely help solve this issue as well: kubernetes/enhancements#1607.
I am also dealing with this same issue.
The approach I had in mind was another option for externalTrafficPolicy
My thought was... What if in addition to Local
and Cluster
there was something like a LocalBeforeCluster
or ClusterButPreferLocal
where the behavior would always prefer a local pod, and retain the ability to preserve client IP when the traffic is local. But as a fallback, it would bounce the traffic to other Pods in the cluster if the local one isn't available.
With this "solution" the load balancer would never really detect a problem, during normal ingress controller updates, but it would if there was a more serious problem going on at the node.
I'm sure there's some sort of gotcha in how the service routing works that might make it impossible to have it both ways, but figured I would mention this angle as well.
This is not a solution to the root cause, but I've been working on an approach that is proving to work out pretty nicely for me. I have the same exact use case as you. I want to be able to update my ingress controllers without experiencing any downtime due to load balancer health check lag.
In my case I'm using the nginx-ingress controller, and I'm using the stable/nginx-ingress helm chart to deploy it. Some key settings are:
- DaemonSet
Service
withtype: NodePort
andexternalTrafficPolicy: Local
My solution is to run TWO DaemonSets. The two DaemonSets have matching selector/labels, so that they are ALL fronted by the same Service.
With this setup, all I have to do is just ensure that redeployments happen to only one DaemonSet at a time to avoid a situation where a particular node fails to have any running ingress-controller pods.
To facilitate this, what i've prototyped are some changes to the helm chart that make it possible for me to have two separate helm releases. The first helm release deploys a daemonset, service, configmap. The second helm release deploys another daemonset, but skips the service and configmap and ensures that the daemonset's labels match those of the first release.
I've also played around with the idea of a single helm release, that has the extra daemonset. Howwever, with that approach, I think it would be more dificult to do ingress controller upgrades, since the two helm release approach allows me to update the helm chart on ONE before the OTHER.
Regarding that last statement I made above. I do think the single helm chart is viable, because I think that a pod disruption policy that applies to both daemonsets simultaneously could be used to ensure that never more than one pod is out at a time.
Edit: I don't know if this will work the way I was hoping. Seems like PodDisruptionBudgets don't work with DaemonSets the way I thought they would.
Ideally to make my double daemonset approach work, I would like to be able to have a PDB who's selectors match both daemonsets, but limit the max unavailable to 1. Forcing a redeployment to only impact one pod at a time.
/assign
Working on a proposal that also aligns with kubernetes/enhancements#1607
FYI I opened a WIP PR to get a better idea of the work involved #89780. Going to continue discussing in the KEP PR kubernetes/enhancements#1607
Agree this is super important - externalTrafficPolicy: local is broken for any reasonable current service that wants to avoid disruption, so I'm supportive of this.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
This issue also causes a problem just using HPA for ingress-nginx with externalTrafficPolicy: Local and the GCP Load Balancer. If there is a spike in traffic that causes the deployment to scale up, it means we will very often see traffic timing out once the traffic dies down and HPA scales down those pods, assuming it happened to be the last pod on a node. The healthCheckNodePort response correctly reports an error status code as soon as the pod begins terminating, but the actual port serving traffic closes immediately even though there is a long preStop delay before the pod actually terminates.
As a mitigation, I've tried running both a Deployment and DaemonSet to ensure at least one pod on each node; the problem is that the DaemonSet pod can be even more dangerous, since it can drop the connection in the middle of a long-running request if the node itself is being terminated.
Any alternative strategies would be welcome here -- the biggest concern is being able to retain the remote IP (for geolocation), and being able to potentially scale the ingress controller dramatically based on spiky traffic. So far, the only viable option seems to be to move to Google's L7 proxy, but that would mean the loss of several features we currently depend on. In the meantime, our only choice seems to be to massively overprovision the deployment and disable HPA.
I'm pretty sure this also applies to us on AWS (and maybe other providers) so should we also label per /area provider/aws
?
PR for the fix is here #96371
I'm pretty sure this also applies to us on AWS (and maybe other providers) so should we also label per
/area provider/aws
?
I observe these issues using both AWS (ELB + NLB) solutions with externalTrafficPolicy: Local
.
I think I and many people are suffering similar problem (LB keep send traffics to deleted pods) although it is type: ClusterIP
and ALB target-type: ip
.
@andrewsykim Would your PR solve following similar issues even it is not externalTrafficPolicy: Local
?
kubernetes/kubernetes
- Pods receive traffic from load balancer whilst in terminating state for >60s #96858
- Pod lifecycle, termination can be improved around LBs and grace period #89263
- Pods in
Terminating
status receive incoming requests #88236 - Connection refused during rolling upgrade of deployment #86280
- Document recommended way to not fail requests during rolling update #20473
- and more ...
kubernetes-sigs/aws-load-balancer-controller - 400/502/504 errors while doing rollout restart or rolling update kubernetes-sigs/aws-load-balancer-controller#1065
- ALB sending requests to pods after ingress controller deregisters them leading to 504s kubernetes-sigs/aws-load-balancer-controller#1064
- 502/503 During deploys and/or pod termination kubernetes-sigs/aws-load-balancer-controller#814
- and more ...
edit: I've made this package to solve the problem that I've explained: https://github.com/foriequal0/pod-graceful-drain
It works for only with aws-load-balancer-controller
and target-type:ip
Service for now, but I'm willing to support other loadbalancers.
@thockin @robscott nothing against "terminating" endpoints if we can get it from the endpoints object itself and if pod will get no termination signal for that time and at best a new pod would be scheduled already.
We would not need to add preStop lifecycle hooks as we describe in https://opensource.zalando.com/skipper/kubernetes/ingress-backends/#pod-lifecycle-hooks.
As ingress controller we run in hostnetwork, because this makes the whole thing very stable from (cloud) load balancer point of view. I think it's the most appropriate way to run ingress controllers anyways. LBs can work on things they know IP:port from their network without NAT.
I just want to clarify that this seems to be a change in endpoints
object and not only a change in endpointslices
as far as I understand https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/1672-tracking-terminating-endpoints, which writes about endpointslice changes.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
FYI alpha feature to fix this issue was merged for v1.22 (#97238), I would appreciate if anyone can try it out and test it. The feature gate is called ProxyTerminatingEndpoints and you're only required to enable it on kube-proxy.
ClusterIp has similar problem, it seems that this issue mentioned pr only solved LoadBalancer or NodePort service?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
ClusterIp has similar problem, it seems that this issue mentioned pr only solved LoadBalancer or NodePort service?
Yes, but there's been some discussions about extending this functionality for internal traffic (clusterIP).
I am using externalTrafficPolicy set to Local for my a LoadBalancer service for an ingress controller on GKE.
Right now, when a pod gets terminated, it is immediately removed from the NodePort service, which stops traffic from routing to the pod (step 5 at https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods).
The problem is that the GCP Load Balancer doesn't update itself immediately, so it continues to send traffic to the NodePort even though Kubernetes has already removed the pod from the NodePort as part of the termination process. This results in timeouts and an inability to do zero downtime deployments when a node no longer has an active application residing on it when externalTrafficPolicy is set to Local.
I'd like to see an option where we can use Local, but allow for zero-downtime deployments.
I'm wondering if there could be a configurable option to wait until a preStop hook has finished (or grace period hits) before removing the pod from the NodePort service? With something like this, we could make a preStop hook that can make health checks fail but have the pod continue to serve traffic normally. The preStop hook could then sleep for a certain amount of time while the load balancers gracefully stop sending traffic because the health checks start to fail. Once the preStop hook completes, then it removes the pod from the NodePort. This would allow for graceful draining of outgoing pods.
Or maybe the answer is a pre-PreStop hook that can run before termination officially begins?
Same issues here.
I just want to add a validatingwebhookconfiguration.
Then this hook always return false util we set the weight to zero of this endpoint in lb then it will return true.
In this way , it's a pre-prestop hook.
Now https://kubernetes.io/docs/concepts/services-networking/service/#external-traffic-policy
ProxyTerminatingEndpoints
supports
@qixiaobo thanks for info.
There is however some confusion between docs related to ProxyTerminatingEndpoints
and KEP-1669 (Proxy Terminating Endpoints).
Quote from the docs https://kubernetes.io/docs/concepts/services-networking/service/#external-traffic-policy:
If there are local endpoints and all of those are terminating, then the kube-proxy ignores any external traffic policy of Local. Instead, whilst the node-local endpoints remain as all terminating, the kube-proxy forwards traffic for that Service to healthy endpoints elsewhere, as if the external traffic policy were set to Cluster.
So traffic is send to endpoints on other nodes if all local endpoints are in terminating state.
Quote from KEP-1669 (https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/1669-proxy-terminating-endpoints#example-all-endpoints-terminating-on-a-node-when-traffic-policy-is-local):
When the traffic policy is "Local" and all endpoints are terminating within a single node, then traffic should be routed to any terminating endpoint that is ready on that node.
So traffic is send to endpoints on the same node even if all local endpoints are in terminating state.
Could you please clarify this if I misunderstand it?
@qixiaobo thanks for info. There is however some confusion between docs related to
ProxyTerminatingEndpoints
and KEP-1669 (Proxy Terminating Endpoints). Quote from the docs https://kubernetes.io/docs/concepts/services-networking/service/#external-traffic-policy:If there are local endpoints and all of those are terminating, then the kube-proxy ignores any external traffic policy of Local. Instead, whilst the node-local endpoints remain as all terminating, the kube-proxy forwards traffic for that Service to healthy endpoints elsewhere, as if the external traffic policy were set to Cluster.
So traffic is send to endpoints on other nodes if all local endpoints are in terminating state.
Quote from KEP-1669 (https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/1669-proxy-terminating-endpoints#example-all-endpoints-terminating-on-a-node-when-traffic-policy-is-local):
When the traffic policy is "Local" and all endpoints are terminating within a single node, then traffic should be routed to any terminating endpoint that is ready on that node.
So traffic is send to endpoints on the same node even if all local endpoints are in terminating state.
Could you please clarify this if I misunderstand it?
I'm a bit confused about this below
When the traffic policy is "Local" and all endpoints are terminating within a single node, then traffic should be routed to any terminating endpoint that is ready on that node.
I cannot finger out what isany terminating endpoint that is ready on that node
Maybe we should read the code for this question.
@qixiaobo thanks for info. There is however some confusion between docs related to
ProxyTerminatingEndpoints
and KEP-1669 (Proxy Terminating Endpoints). Quote from the docs https://kubernetes.io/docs/concepts/services-networking/service/#external-traffic-policy:If there are local endpoints and all of those are terminating, then the kube-proxy ignores any external traffic policy of Local. Instead, whilst the node-local endpoints remain as all terminating, the kube-proxy forwards traffic for that Service to healthy endpoints elsewhere, as if the external traffic policy were set to Cluster.
So traffic is send to endpoints on other nodes if all local endpoints are in terminating state.
Quote from KEP-1669 (https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/1669-proxy-terminating-endpoints#example-all-endpoints-terminating-on-a-node-when-traffic-policy-is-local):When the traffic policy is "Local" and all endpoints are terminating within a single node, then traffic should be routed to any terminating endpoint that is ready on that node.
So traffic is send to endpoints on the same node even if all local endpoints are in terminating state.
Could you please clarify this if I misunderstand it?I'm a bit confused about this below
When the traffic policy is "Local" and all endpoints are terminating within a single node, then traffic should be routed to any terminating endpoint that is ready on that node.
I cannot finger out what isany terminating endpoint that is ready on that node
Maybe we should read the code for this question.
maybe like this?
NAME READY STATUS RESTARTS AGE
nginx-58b98bb74d-xb6vj 1/1 Terminating 0 9m5s
I found methods described below(This is summarized by myself, any mistakes please point out)
ready | terminating | |
---|---|---|
isReady | TRUE | FALSE |
isServing | TRUE | - |
isTerminating | - | TRUE |
// Pre-scan the endpoints, to figure out which type of endpoint Local
// traffic policy will use, and also to see if there are any usable
// endpoints anywhere in the cluster.
var hasLocalReadyEndpoints, hasLocalServingTerminatingEndpoints bool
for _, ep := range endpoints {
if ep.IsReady() {
hasAnyEndpoints = true
if ep.GetIsLocal() {
hasLocalReadyEndpoints = true
}
} else if ep.IsServing() && ep.IsTerminating() && utilfeature.DefaultFeatureGate.Enabled(features.ProxyTerminatingEndpoints) {
hasAnyEndpoints = true
if ep.GetIsLocal() {
hasLocalServingTerminatingEndpoints = true
}
}
}
if hasLocalReadyEndpoints {
localEndpoints = filterEndpoints(endpoints, func(ep Endpoint) bool {
return ep.GetIsLocal() && ep.IsReady()
})
} else if hasLocalServingTerminatingEndpoints {
useServingTerminatingEndpoints = true
localEndpoints = filterEndpoints(endpoints, func(ep Endpoint) bool {
return ep.GetIsLocal() && ep.IsServing() && ep.IsTerminating()
})
}
if !svcInfo.UsesClusterEndpoints() {
allReachableEndpoints = localEndpoints
return
}
So @saihide is right.
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase
When a Pod is being deleted, it is shown as Terminating by some kubectl commands. This Terminating status is not one of the Pod phases. A Pod is granted a term to terminate gracefully, which defaults to 30 seconds.
We must know when a pod is in terminating status, it may still in service.
I believe we have found a solution that works for us. Please let me know if my understanding is lacking here.
We run externalTrafficPolicy: Local
on EKS with alb-ingres-controller 1.1.
We have a five minute sleep at the top of our preStop hook, with a longer terminationGracePeriodSeconds
.
I set up a locust load test as an easy way to count 502s. So I'm flooding 2-3 pods (with 4 puma processes each) with 40+ requests per second.
Upon terminating a pod, or any action that incurs a rolling restart of the deployment, we see immediate 502s, for about 10-15 seconds, and some intermittently thereafter.
After adding alb.ingress.kubernetes.io/target-type: ip
to the ingress (since the default type is "instance"), we saw a night and day difference - literally zero 502s on pod terminations.
This also has the nice side effect of much cleaner target groups, and a reduced Total Targets Per Listener (which is an AWS limit that we've hit). Now, instead of every node in the k8s cluster being a target (of which 95% fail health checks due to externalTrafficPolicy: Local
), only the actual pod IPs are targets.
@WillPlatnick described the issue precisely:
Right now, when a pod gets terminated, it is immediately removed from the NodePort service, which stops traffic from routing to the pod (step 5 at https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods).
The problem is that the GCP Load Balancer doesnβt update itself immediately, so it continues to send traffic to the NodePort even though Kubernetes has already removed the pod from the NodePort as part of the termination process. This results in timeouts and an inability to do zero downtime deployments when a node no longer has an active application residing on it when externalTrafficPolicy is set to Local.
Because weβre now sending traffic straight to the pod IP, it doesnβt matter that the podβs endpoint is immediately removed from the Serviceβs list of endpoints; The ALB can still send traffic straight to the pod. Now, we can have the desired drain behavior, where in-flight requests can continue, and the pod is removed from the ALB target list before the web server is shut down.
The downside would be that, presumably, this skips k8s traffic routing. So, if you're using IPVS to do different load balancing strategies, I can only imagine this skips that. Luckily, we are able to set the ALB to use Least Outsanding Requests instead of Round Robin, using alb.ingress.kubernetes.io/target-group-attributes: load_balancing.algorithm.type=least_outstanding_requests
.
Thank you to @eddspencer for pointing out target-type
on this issue: #96858 (comment)
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
FYI Kubernetes v1.26 promotes the ProxyTerminatingEndpoints
feature to Beta and enabled by default, which should address issues raised here. It would be greatly appreciated if folks can try the new release and help verify if you're seeing the expected behavior
cc @ionutbalutoiu re: #114052
/close
Fixes by the feature ProxyTerminatingEndpoints
https://gist.github.com/aojea/cd72e17b7238114a35cb9c82bf2324cb
@aojea: Closing this issue.
In response to this:
/close
Fixes by the feature ProxyTerminatingEndpoints
https://gist.github.com/aojea/cd72e17b7238114a35cb9c82bf2324cb
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.