Consider adding a 2nd cluster-autoscaler with a small footprint for the operator node group only
RobertLucian opened this issue · 0 comments
Description
The cluster-autoscaler
can use up lots of memory when there are lots of nodes to keep track of or/and lots of nodes that need to be added. The cluster-autoscaler
has been observed to use up to 1.4GiB of memory (our current limit is set to 1GiB).
If the autoscaler gets evicted because it's using too much memory, there won't be a way to scale up the operator node group for the subsequent pending pod of the autoscaler as there is no autoscaler left to do the job. The suggestion is to have another cluster autoscaler with a minimal resource footprint that would only be responsible for scaling the operator node group. Because this autoscaler will only be watching a single node group (that can only go up to 25 nodes) and because we would be setting the addition rate limit to a small value (i.e. 1-2 node / min), we can be sure this autoscaler won't go big on the resource utilization. This autoscaler will scale up the node group in case the other cluster autoscaler gets evicted.
This cluster autoscaler deployment should also have a higher priority than anything else on the operator node group, to ensure that every other cortex pod will receive a node:
https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/