kubernetes/kubernetes

Create ability to do zero downtime deployments when using externalTrafficPolicy: Local

Closed this issue Β· 46 comments

I am using externalTrafficPolicy set to Local for my a LoadBalancer service for an ingress controller on GKE.

Right now, when a pod gets terminated, it is immediately removed from the NodePort service, which stops traffic from routing to the pod (step 5 at https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods).

The problem is that the GCP Load Balancer doesn't update itself immediately, so it continues to send traffic to the NodePort even though Kubernetes has already removed the pod from the NodePort as part of the termination process. This results in timeouts and an inability to do zero downtime deployments when a node no longer has an active application residing on it when externalTrafficPolicy is set to Local.

I'd like to see an option where we can use Local, but allow for zero-downtime deployments.

I'm wondering if there could be a configurable option to wait until a preStop hook has finished (or grace period hits) before removing the pod from the NodePort service? With something like this, we could make a preStop hook that can make health checks fail but have the pod continue to serve traffic normally. The preStop hook could then sleep for a certain amount of time while the load balancers gracefully stop sending traffic because the health checks start to fail. Once the preStop hook completes, then it removes the pod from the NodePort. This would allow for graceful draining of outgoing pods.

Or maybe the answer is a pre-PreStop hook that can run before termination officially begins?

/sig network cloud-provider
/kind feature
/area provider/gcp

/triage unresolved

Comment /remove-triage unresolved when the issue is assessed and confirmed.

πŸ€– I am a bot run by vllry. πŸ‘©β€πŸ”¬

Having the same issue. Any graceful termination in the pod itself will not help, since the port at node level stop accepting traffic.

Same issue here. I tried to fix in a few months ago but gave up in the end and set externalTrafficPolicy to Cluster (loosing the client ip but deploy zero-downtime). The issue is that the Google LBs don't get a signal BEFORE a pod is terminated, so only after the not configurable healthcheck timeout the node is deregistered from the LB. Found no way so far to remove the node from the LB before terminating the POD.

This is a problem with 2-hop load-balancing - the end-of-life handling of pods doesn't really have a way to describe "upstream" dependencies and sequencing. The endpoints controller sees the pod as terminating and immediately removes it from the set. Kube-proxy has no choice but to also remove it from the list of available backends. As you described, the upstream LB hasn't received the news yet.

For HTTP apps, If you use Ingress and VPC-Native LB (on GCP) you will bypass this second hop (kube-proxy) and the LB goes directly to the pod. During the terminationGracePeriod, the pod will be removed from the LB.

For apps that use Service, this remains a problem. I'd like this to be possible. It probably needs a KEP to cover the details, but maybe something like:

  • Instead of removing a terminating endpoint, move it to NotReadyAddresses
  • In kube-proxy, if a service has no viable ready endpoints, but has not-ready addresses, use those

For "local" services that would still use the terminating endpoint. Presumably, upstream LBs would be deconfigured and incoming traffic would taper off.

That doesn't seem egregiously complicated to me, but I bet there are corner cases. Here's one - NotReady covers both startup and teardown. Would we want LBs to go to not-yet-initialized backends at the beginning of life? It seems wrong but not terribly so. Here's another - it will add a lot of endpoints writes as we have to process both states.

@freehan @robscott

I don't have much to add here, agree that it would likely take a KEP to completely solve this. One may already be in the works that could help. Although this primarily has IPVS in mind, the proposed API change here could likely help solve this issue as well: kubernetes/enhancements#1607.

I am also dealing with this same issue.

The approach I had in mind was another option for externalTrafficPolicy My thought was... What if in addition to Local and Cluster there was something like a LocalBeforeCluster or ClusterButPreferLocal where the behavior would always prefer a local pod, and retain the ability to preserve client IP when the traffic is local. But as a fallback, it would bounce the traffic to other Pods in the cluster if the local one isn't available.

With this "solution" the load balancer would never really detect a problem, during normal ingress controller updates, but it would if there was a more serious problem going on at the node.

I'm sure there's some sort of gotcha in how the service routing works that might make it impossible to have it both ways, but figured I would mention this angle as well.

@WillPlatnick

This is not a solution to the root cause, but I've been working on an approach that is proving to work out pretty nicely for me. I have the same exact use case as you. I want to be able to update my ingress controllers without experiencing any downtime due to load balancer health check lag.

In my case I'm using the nginx-ingress controller, and I'm using the stable/nginx-ingress helm chart to deploy it. Some key settings are:

  • DaemonSet
  • Service with type: NodePort and externalTrafficPolicy: Local

My solution is to run TWO DaemonSets. The two DaemonSets have matching selector/labels, so that they are ALL fronted by the same Service.

With this setup, all I have to do is just ensure that redeployments happen to only one DaemonSet at a time to avoid a situation where a particular node fails to have any running ingress-controller pods.

To facilitate this, what i've prototyped are some changes to the helm chart that make it possible for me to have two separate helm releases. The first helm release deploys a daemonset, service, configmap. The second helm release deploys another daemonset, but skips the service and configmap and ensures that the daemonset's labels match those of the first release.

I've also played around with the idea of a single helm release, that has the extra daemonset. Howwever, with that approach, I think it would be more dificult to do ingress controller upgrades, since the two helm release approach allows me to update the helm chart on ONE before the OTHER.

Regarding that last statement I made above. I do think the single helm chart is viable, because I think that a pod disruption policy that applies to both daemonsets simultaneously could be used to ensure that never more than one pod is out at a time.

Edit: I don't know if this will work the way I was hoping. Seems like PodDisruptionBudgets don't work with DaemonSets the way I thought they would.

Ideally to make my double daemonset approach work, I would like to be able to have a PDB who's selectors match both daemonsets, but limit the max unavailable to 1. Forcing a redeployment to only impact one pod at a time.

/assign

Working on a proposal that also aligns with kubernetes/enhancements#1607

FYI I opened a WIP PR to get a better idea of the work involved #89780. Going to continue discussing in the KEP PR kubernetes/enhancements#1607

Agree this is super important - externalTrafficPolicy: local is broken for any reasonable current service that wants to avoid disruption, so I'm supportive of this.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

This issue also causes a problem just using HPA for ingress-nginx with externalTrafficPolicy: Local and the GCP Load Balancer. If there is a spike in traffic that causes the deployment to scale up, it means we will very often see traffic timing out once the traffic dies down and HPA scales down those pods, assuming it happened to be the last pod on a node. The healthCheckNodePort response correctly reports an error status code as soon as the pod begins terminating, but the actual port serving traffic closes immediately even though there is a long preStop delay before the pod actually terminates.

As a mitigation, I've tried running both a Deployment and DaemonSet to ensure at least one pod on each node; the problem is that the DaemonSet pod can be even more dangerous, since it can drop the connection in the middle of a long-running request if the node itself is being terminated.

Any alternative strategies would be welcome here -- the biggest concern is being able to retain the remote IP (for geolocation), and being able to potentially scale the ingress controller dramatically based on spiky traffic. So far, the only viable option seems to be to move to Google's L7 proxy, but that would mean the loss of several features we currently depend on. In the meantime, our only choice seems to be to massively overprovision the deployment and disable HPA.

I'm pretty sure this also applies to us on AWS (and maybe other providers) so should we also label per /area provider/aws ?

PR for the fix is here #96371

I'm pretty sure this also applies to us on AWS (and maybe other providers) so should we also label per /area provider/aws ?

I observe these issues using both AWS (ELB + NLB) solutions with externalTrafficPolicy: Local.

I think I and many people are suffering similar problem (LB keep send traffics to deleted pods) although it is type: ClusterIP and ALB target-type: ip.

@andrewsykim Would your PR solve following similar issues even it is not externalTrafficPolicy: Local?
kubernetes/kubernetes

edit: I've made this package to solve the problem that I've explained: https://github.com/foriequal0/pod-graceful-drain
It works for only with aws-load-balancer-controller and target-type:ip Service for now, but I'm willing to support other loadbalancers.

@thockin @robscott nothing against "terminating" endpoints if we can get it from the endpoints object itself and if pod will get no termination signal for that time and at best a new pod would be scheduled already.
We would not need to add preStop lifecycle hooks as we describe in https://opensource.zalando.com/skipper/kubernetes/ingress-backends/#pod-lifecycle-hooks.
As ingress controller we run in hostnetwork, because this makes the whole thing very stable from (cloud) load balancer point of view. I think it's the most appropriate way to run ingress controllers anyways. LBs can work on things they know IP:port from their network without NAT.

I just want to clarify that this seems to be a change in endpoints object and not only a change in endpointslices as far as I understand https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/1672-tracking-terminating-endpoints, which writes about endpointslice changes.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

/remove-lifecycle stale

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

/remove-lifecycle stale

FYI alpha feature to fix this issue was merged for v1.22 (#97238), I would appreciate if anyone can try it out and test it. The feature gate is called ProxyTerminatingEndpoints and you're only required to enable it on kube-proxy.

ClusterIp has similar problem, it seems that this issue mentioned pr only solved LoadBalancer or NodePort service?

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rdxmb commented

/remove-lifecycle stale

ClusterIp has similar problem, it seems that this issue mentioned pr only solved LoadBalancer or NodePort service?

Yes, but there's been some discussions about extending this functionality for internal traffic (clusterIP).

I am using externalTrafficPolicy set to Local for my a LoadBalancer service for an ingress controller on GKE.

Right now, when a pod gets terminated, it is immediately removed from the NodePort service, which stops traffic from routing to the pod (step 5 at https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods).

The problem is that the GCP Load Balancer doesn't update itself immediately, so it continues to send traffic to the NodePort even though Kubernetes has already removed the pod from the NodePort as part of the termination process. This results in timeouts and an inability to do zero downtime deployments when a node no longer has an active application residing on it when externalTrafficPolicy is set to Local.

I'd like to see an option where we can use Local, but allow for zero-downtime deployments.

I'm wondering if there could be a configurable option to wait until a preStop hook has finished (or grace period hits) before removing the pod from the NodePort service? With something like this, we could make a preStop hook that can make health checks fail but have the pod continue to serve traffic normally. The preStop hook could then sleep for a certain amount of time while the load balancers gracefully stop sending traffic because the health checks start to fail. Once the preStop hook completes, then it removes the pod from the NodePort. This would allow for graceful draining of outgoing pods.

Or maybe the answer is a pre-PreStop hook that can run before termination officially begins?

Same issues here.
I just want to add a validatingwebhookconfiguration.
Then this hook always return false util we set the weight to zero of this endpoint in lb then it will return true.
In this way , it's a pre-prestop hook.

r0bj commented

@qixiaobo thanks for info.
There is however some confusion between docs related to ProxyTerminatingEndpoints and KEP-1669 (Proxy Terminating Endpoints).
Quote from the docs https://kubernetes.io/docs/concepts/services-networking/service/#external-traffic-policy:

If there are local endpoints and all of those are terminating, then the kube-proxy ignores any external traffic policy of Local. Instead, whilst the node-local endpoints remain as all terminating, the kube-proxy forwards traffic for that Service to healthy endpoints elsewhere, as if the external traffic policy were set to Cluster.

So traffic is send to endpoints on other nodes if all local endpoints are in terminating state.

Quote from KEP-1669 (https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/1669-proxy-terminating-endpoints#example-all-endpoints-terminating-on-a-node-when-traffic-policy-is-local):

When the traffic policy is "Local" and all endpoints are terminating within a single node, then traffic should be routed to any terminating endpoint that is ready on that node.

So traffic is send to endpoints on the same node even if all local endpoints are in terminating state.

Could you please clarify this if I misunderstand it?

@qixiaobo thanks for info. There is however some confusion between docs related to ProxyTerminatingEndpoints and KEP-1669 (Proxy Terminating Endpoints). Quote from the docs https://kubernetes.io/docs/concepts/services-networking/service/#external-traffic-policy:

If there are local endpoints and all of those are terminating, then the kube-proxy ignores any external traffic policy of Local. Instead, whilst the node-local endpoints remain as all terminating, the kube-proxy forwards traffic for that Service to healthy endpoints elsewhere, as if the external traffic policy were set to Cluster.

So traffic is send to endpoints on other nodes if all local endpoints are in terminating state.

Quote from KEP-1669 (https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/1669-proxy-terminating-endpoints#example-all-endpoints-terminating-on-a-node-when-traffic-policy-is-local):

When the traffic policy is "Local" and all endpoints are terminating within a single node, then traffic should be routed to any terminating endpoint that is ready on that node.

So traffic is send to endpoints on the same node even if all local endpoints are in terminating state.

Could you please clarify this if I misunderstand it?

I'm a bit confused about this below

When the traffic policy is "Local" and all endpoints are terminating within a single node, then traffic should be routed to any terminating endpoint that is ready on that node.
I cannot finger out what is any terminating endpoint that is ready on that node
Maybe we should read the code for this question.

@qixiaobo thanks for info. There is however some confusion between docs related to ProxyTerminatingEndpoints and KEP-1669 (Proxy Terminating Endpoints). Quote from the docs https://kubernetes.io/docs/concepts/services-networking/service/#external-traffic-policy:

If there are local endpoints and all of those are terminating, then the kube-proxy ignores any external traffic policy of Local. Instead, whilst the node-local endpoints remain as all terminating, the kube-proxy forwards traffic for that Service to healthy endpoints elsewhere, as if the external traffic policy were set to Cluster.

So traffic is send to endpoints on other nodes if all local endpoints are in terminating state.
Quote from KEP-1669 (https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/1669-proxy-terminating-endpoints#example-all-endpoints-terminating-on-a-node-when-traffic-policy-is-local):

When the traffic policy is "Local" and all endpoints are terminating within a single node, then traffic should be routed to any terminating endpoint that is ready on that node.

So traffic is send to endpoints on the same node even if all local endpoints are in terminating state.
Could you please clarify this if I misunderstand it?

I'm a bit confused about this below

When the traffic policy is "Local" and all endpoints are terminating within a single node, then traffic should be routed to any terminating endpoint that is ready on that node.
I cannot finger out what is any terminating endpoint that is ready on that node
Maybe we should read the code for this question.

maybe like this?

NAME                       READY   STATUS                RESTARTS   AGE
nginx-58b98bb74d-xb6vj     1/1     Terminating            0          9m5s

@r0bj

I found methods described below(This is summarized by myself, any mistakes please point out)

  ready terminating
isReady TRUE FALSE
isServing TRUE -
isTerminating - TRUE

https://github.com/kubernetes/kubernetes/blame/699aeb735ff34d81d79817bf613203ac58e6edc3/pkg/proxy/topology.go#L26

// Pre-scan the endpoints, to figure out which type of endpoint Local
// traffic policy will use, and also to see if there are any usable
// endpoints anywhere in the cluster.
var hasLocalReadyEndpoints, hasLocalServingTerminatingEndpoints bool
for _, ep := range endpoints {
if ep.IsReady() {
	hasAnyEndpoints = true
	if ep.GetIsLocal() {
		hasLocalReadyEndpoints = true
	}
} else if ep.IsServing() && ep.IsTerminating() && utilfeature.DefaultFeatureGate.Enabled(features.ProxyTerminatingEndpoints) {
	hasAnyEndpoints = true
	if ep.GetIsLocal() {
		hasLocalServingTerminatingEndpoints = true
	}
}
}

if hasLocalReadyEndpoints {
localEndpoints = filterEndpoints(endpoints, func(ep Endpoint) bool {
	return ep.GetIsLocal() && ep.IsReady()
})
} else if hasLocalServingTerminatingEndpoints {
useServingTerminatingEndpoints = true
localEndpoints = filterEndpoints(endpoints, func(ep Endpoint) bool {
	return ep.GetIsLocal() && ep.IsServing() && ep.IsTerminating()
})
}

if !svcInfo.UsesClusterEndpoints() {
allReachableEndpoints = localEndpoints
    return
}

So @saihide is right.
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase

When a Pod is being deleted, it is shown as Terminating by some kubectl commands. This Terminating status is not one of the Pod phases. A Pod is granted a term to terminate gracefully, which defaults to 30 seconds.
We must know when a pod is in terminating status, it may still in service.

I believe we have found a solution that works for us. Please let me know if my understanding is lacking here.

We run externalTrafficPolicy: Local on EKS with alb-ingres-controller 1.1.
We have a five minute sleep at the top of our preStop hook, with a longer terminationGracePeriodSeconds.

I set up a locust load test as an easy way to count 502s. So I'm flooding 2-3 pods (with 4 puma processes each) with 40+ requests per second.

Upon terminating a pod, or any action that incurs a rolling restart of the deployment, we see immediate 502s, for about 10-15 seconds, and some intermittently thereafter.

After adding alb.ingress.kubernetes.io/target-type: ip to the ingress (since the default type is "instance"), we saw a night and day difference - literally zero 502s on pod terminations.

This also has the nice side effect of much cleaner target groups, and a reduced Total Targets Per Listener (which is an AWS limit that we've hit). Now, instead of every node in the k8s cluster being a target (of which 95% fail health checks due to externalTrafficPolicy: Local), only the actual pod IPs are targets.

@WillPlatnick described the issue precisely:

Right now, when a pod gets terminated, it is immediately removed from the NodePort service, which stops traffic from routing to the pod (step 5 at https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods).
The problem is that the GCP Load Balancer doesn’t update itself immediately, so it continues to send traffic to the NodePort even though Kubernetes has already removed the pod from the NodePort as part of the termination process. This results in timeouts and an inability to do zero downtime deployments when a node no longer has an active application residing on it when externalTrafficPolicy is set to Local.

Because we’re now sending traffic straight to the pod IP, it doesn’t matter that the pod’s endpoint is immediately removed from the Service’s list of endpoints; The ALB can still send traffic straight to the pod. Now, we can have the desired drain behavior, where in-flight requests can continue, and the pod is removed from the ALB target list before the web server is shut down.

The downside would be that, presumably, this skips k8s traffic routing. So, if you're using IPVS to do different load balancing strategies, I can only imagine this skips that. Luckily, we are able to set the ALB to use Least Outsanding Requests instead of Round Robin, using alb.ingress.kubernetes.io/target-group-attributes: load_balancing.algorithm.type=least_outstanding_requests.

Thank you to @eddspencer for pointing out target-type on this issue: #96858 (comment)

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

/remove-lifecycle stale

FYI Kubernetes v1.26 promotes the ProxyTerminatingEndpoints feature to Beta and enabled by default, which should address issues raised here. It would be greatly appreciated if folks can try the new release and help verify if you're seeing the expected behavior

aojea commented

/close

Fixes by the feature ProxyTerminatingEndpoints

https://gist.github.com/aojea/cd72e17b7238114a35cb9c82bf2324cb

@aojea: Closing this issue.

In response to this:

/close

Fixes by the feature ProxyTerminatingEndpoints

https://gist.github.com/aojea/cd72e17b7238114a35cb9c82bf2324cb

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.