gardener/machine-controller-manager

Reduce load on kube-apiserver / ETCD on pod eviction for termination machines

MartinWeindel opened this issue · 4 comments

How to categorize this issue?

/area robustness
/kind bug
/priority 3

What happened:

On rolling nodes of a large cluster with about 100 nodes, the max surge was limited to 10%. The termination of a node was waiting for workload to end. Therefore the rolling took more than 12 hours. This means during this time there have been about 10 machines in state Terminating. The machine controller manager tried to evict pods on these nodes with high frequency and produced a lot of load on the kube-apiservers and the ETCD.

image

In this image you see the client out traffic of the ETCD. Between 15:20 and 15:30 we scaled down the machine-controller-manager and the traffic was reduced immediately.

In the logs of the machine-controller-manager we found lots of throttling messages. Here I show only the eviction call for a single pod (name redacted):

I0412 15:29:36.845890       1 request.go:591] Throttling request took 542.507996ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:29:48.512932       1 request.go:591] Throttling request took 290.637231ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:29:59.562728       1 request.go:591] Throttling request took 338.986248ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:30:09.012570       1 request.go:591] Throttling request took 419.676173ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:30:19.270585       1 request.go:591] Throttling request took 355.943286ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:30:40.369073       1 request.go:591] Throttling request took 251.448724ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:30:51.019605       1 request.go:591] Throttling request took 329.512125ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:31:00.169268       1 request.go:591] Throttling request took 596.340432ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:31:05.519530       1 request.go:591] Throttling request took 181.186876ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:31:14.868701       1 request.go:591] Throttling request took 545.271603ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:31:23.119681       1 request.go:591] Throttling request took 388.854376ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:31:31.669300       1 request.go:591] Throttling request took 246.905561ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:31:40.369208       1 request.go:591] Throttling request took 374.552037ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:31:44.791522       1 request.go:591] Throttling request took 93.770177ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction
I0412 15:31:59.091661       1 request.go:591] Throttling request took 495.377239ms, request: POST:https://kube-apiserver/api/v1/namespaces/ws-xxxxx/pods/xxxx-deployment-78864894dd-2xdfd/eviction

The eviction request is repeated every 5 to 10 seconds.

What you expected to happen:

The machine controller manager should reduce the frequency of pod eviction calls if the termination of a machine takes a long time.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Internal reference: see live issue #1570

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:

/cc @kon-angelo

One approach is to introduce backoff in the rate of pod eviction requests based on the amt of throttling for the request. The backoff can be limited, so we don't go over 1min b/w two consecutive requests for a pod. This way the size of the cluster is not a factor we would consider. This makes sense as the size and number of pods in the cluster are not necessarily related.
Currently, we lack data which could help us to find the best value for the max interval b/w two requests , so keeping it to 1min can be done.

After discussion , we found out that the live issue was occuring because the PDB for each pod on the draining node were misconfigured . Currently for misconfigured case , we don't attemptEvict again and return err which goes till the drainNode() and we do a ShortRetry . This leads to a lot of load of calling of drainNode() and high load.
Generally we end up doing a attemptEvict until drainTimeout happens for a few pods on the node and we don't retry that often, but this was a corner case.

We would also want to deal with many pdbs for a single pod case, as that is also a kind of misconfiguration.

Proposed solution:

  • in case of misconfigured pdb and many pdbs for single pod , we should return with a medium retry (i.e. 3min)
  • in other cases, we will attempt evict with a podEvictionRetryInterval (20sec) like we do right now.