vertical-pod-autoscaler: rate limit pod evictions
milesbxf opened this issue · 8 comments
We've been using the VPA for a couple of weeks, and it's generally working really well 👍unfortunately, on a couple of occasions, it's working a bit too effectively, and creating a ton of pod evictions in a short space of time. We run a fairly large cluster, and on a couple of occasions it's evicted about 2-3k pods in the space of 15-20 minutes.
This large pod churn puts a huge strain on the apiservers and service discovery mechanisms - we're scaling these up, but we also think there should be a way of restricting how many pods the vpa-updater
can kill in a period of time.
We could try doing this with the --kube-api-qps
flag for this, but this stays generally low (4-5 RPS), so not ideal. I was thinking about introducing a new CLI flag on the updater, something like --max-pod-evictions-per-sec
, which would explicitly rate limit the amount of pod evictions the updater could do.
I'm very happy to do the work myself on this; I just thought I'd start a conversation on how this might work best 🙂
Thanks for reaching out! This is definitely an important topic.
I have a couple of clarifying questions.
Do you have pod disruption budgets defined for these pods?
Are those pods part of one deployment or several different deployments?
Do you know what is causing the updates to happen? i.e. what happens in the cluster that suddenly so many pods have their requests far from the recommended values?
Hey! Thanks for the quick response 🙂
Do you have pod disruption budgets defined for these pods?
Yes, we do. Let me clarify a bit further - we use Envoy as a service mesh, and have a lightweight control plane which follows K8s Endpoint updates to determine where to route traffic. The pod churn hugely increased the rate of endpoint updates, which overloaded the API server (in one instance) and the control plane (in another instance), which caused Envoy to route requests to stale pod IPs. We've scaled up both the API server and control plane to deal with this. The PDBs seemed to work well, and we didn't see any loss of availability due to too many pods being down.
Are those pods part of one deployment or several different deployments?
Different deployments (and matching VPA objects) - around 1000 or so
Do you know what is causing the updates to happen? i.e. what happens in the cluster that suddenly so many pods have their requests far from the recommended values?
In the first instance, it was when we initially deployed and turned on the VPA (we should have done more of a staged rollout 😅 ). We had another instance last night where the vpa-updater
got redeployed, and it churned through a ton of pods - we're still looking into that, might potentially be a bug (if so, I'll raise a separate issue).
Thanks for clarifying!
@kgolab @schylek for visibility
Looking forward to anything you can find on those updates after vpa-updater redeployment - this does sound like a potential bug.
As for the initial deployment case, it never hurts to do staged deployment, but still it could be handled better on VPA side.
Side note, 1000 deployments is on a bigger side for VPA (as this is still beta), there haven't been any extensive tests on how it will work at that scale (we've been successfully testing up to 200 deployments so far, though I have talked to multiple people running more that 200 successfully). Any insights on running VPA at that scale are more than welcome :)
So to summarize this is not an issue with pod downtime itself, since PDB protects the deployments from that, the pod churn added by VPA updater causes API-server and Envoy control plane to misbehave, mostly due to multiple updates to k8s Endpoints.
@wojtek-t from Kubernetes scalability perspective, are there any mechanisms we could use to limit the churn on API server caused by VPA recreating the pods? And secondly, there is work for better endpoints scalability AFAIR. Is it relevant here?
As for the Envoy control plane, not sure what scalability guarantees/implications it has. Do you know if there is a way to look this up?
@wojtek-t from Kubernetes scalability perspective, are there any mechanisms we could use to limit the churn on API server caused by VPA recreating the pods?
Other than qps-limits on client side, nothing else exists now.
You can try to build some wrapper around client to do something if you need, but I'm not aware of any building blocks here.
And secondly, there is work for better endpoints scalability AFAIR. Is it relevant here?
It seems related. Basically those two open KEPs seem to be related:
kubernetes/enhancements#1086
kubernetes/enhancements#924
There are also couple more-local improvements too. The churn on Endpoints objects is something we're actively looking into.
Thanks for clarifying!
@kgolab @schylek for visibility
Looking forward to anything you can find on those updates after vpa-updater redeployment - this does sound like a potential bug.
As for the initial deployment case, it never hurts to do staged deployment, but still it could be handled better on VPA side.
Thanks so much for your helpful response!
Side note, 1000 deployments is on a bigger side for VPA (as this is still beta), there haven't been any extensive tests on how it will work at that scale (we've been successfully testing up to 200 deployments so far, though I have talked to multiple people running more that 200 successfully). Any insights on running VPA at that scale are more than welcome :)
Aha - that's good to know!
Other than this issue, we've had a very good experience with it, and it seems to scale quite nicely. Altogether it's doing around 80 requests/sec to the API server in normal operation (so adds a bit of load there). The recommender is consistently using under 200 millicores and just over 1GiB memory, and recommendation execution latency is about 7 seconds on average, and the updater is under 100 millicores/750MiB with execution latency of 1-2 seconds 🎉
So to summarize this is not an issue with pod downtime itself, since PDB protects the deployments from that, the pod churn added by VPA updater causes API-server and Envoy control plane to misbehave, mostly due to multiple updates to k8s Endpoints.
Correct 👍
@wojtek-t from Kubernetes scalability perspective, are there any mechanisms we could use to limit the churn on API server caused by VPA recreating the pods? And secondly, there is work for better endpoints scalability AFAIR. Is it relevant here?
As for the Envoy control plane, not sure what scalability guarantees/implications it has. Do you know if there is a way to look this up?
We scaled up the API servers which definitely seems to have helped (endpoint update p99 is consistently below 15ms), and we're also scaling up our Envoy control plane, which should also move things in the right direction.
👋 @bskiba
I might have a go at doing this this week - broadly speaking, I'm thinking of doing a client-side rate limiting implementation. Are there any preexisting K8s go rate limiting implementations I can make use of that you know about?
The kubernetes client itself has some rate-limiting, though don't know how useful you will find that: https://github.com/kubernetes/client-go/blob/master/util/flowcontrol/throttle.
Hey! 👋 I gave it a go and added a rate limiter to the VPA updater. It uses golang rate which was already a dependency.