gardener/machine-controller-manager

Node label/taint/annotation placing logic needs to be improved

Closed this issue · 6 comments

How to categorize this issue?

/area performance
/kind discussion
/kind enhancement
/priority 3

What would you like to be added:
MCM needs to place node annotations,labels,taints(ALT) configured more efficiently. Refer Canary issue # 2918

Why is this needed:
There could be reasons for delay in applying ALT to nodes by MCM. Some of the reasons could be:

  1. API server throttling
  2. MCM crashes and recovers and in between nodes were marked as Ready allowing pods to be scheduled onto them.
    This impacts correct scheduling of workloads onto nodes controlled either via node labels or taints.

CCM initialization follows a mechanism where kubelet places the following taints

"taints": [
                    {
                        "effect": "NoSchedule",
                        "key": "node.cloudprovider.kubernetes.io/uninitialized",
                        "value": "true"
                    }

on the node , and then CCM places the annotations and then removes this taint.

We could follow a similar approach for MCM
Taints to be placed on node, could be configured on the kubelet via --register-with-taints flag.

Upside of approach

  • more deterministic
  • quicker to implement

Downside of approach

  • delay in node in becoming completely ready with all ALT applied even for nodes which do not have any taints or node labels which influence scheduling of workloads onto node(s).

The second approach is using mutating webhook to apply ALT

Upside of approach

  • deterministic
  • less delay as during admission itself , the node will get the ALT

Downside of approach

  • mutating webhook acts only on CRUD requests , so only at that time ALT update would happen and not in between, while currently it happens on machine reconciliation which gets triggered on machine object update on ALT update.
    • one solution could be to mutate node objects yaml on machine object CRUD request related to ALT update, but I don't think this is allowed.
  • will take more time to implement

Post grooming discussion

We currently can't distinguish b/w important and un-important labels/annotations. Important labels/annotations are those which are required to be placed on the node immediately as they control scheduling of pods, while un-important are lets say for classification. So the taint logic for MCM to ensure all ALT are placed before pods are scheduled can't be used, as we can have problems if MCM is unavailable to remove the taint.

We identified another flaw in code which leads to late addition of ALT, and solving that would considerably improve the speed of ALT application. Once we swap these two calls it should be sufficient.

Regarding taints , all taints are important, so the plan should be:

  • pass the taint via kubelet --taint flag, for the taints we know during node creation
  • any change in the required taint will still be synced by MCM

But we need to read the documentation thoroughly before we implement the taint part.

We observe this issue from time to time that a pod is scheduled on a node which has some taints which are not tolerated by the pod.

Likewise we have observed this issue in AI Core, which leads during a Garden Linux upgrade (i.e. large number of nodes deleted & created) to non-GPU pods being scheduled to GPU nodes. The GPU-consuming pods no longer fit thus triggering scale-up of another GPU node. We therefore end up running expensive GPU nodes for non-GPU pods.