kubernetes/kubernetes

Ensure that pods are scheduled to nodes that meet preferred conditions, while satisfying a series of filter plugins for the scheduler.

fanhaouu opened this issue · 6 comments

What would you like to be added?

/sig scheduling
/kind feature

Add a new plugin extension to check nodes. Then modify the scheduling filter logic to prioritize nodes that satisfy preferred check conditions, ensuring that these nodes are placed at the beginning of the node array to ensure that the scheduler prioritizes them during each scheduling attempt.

If the community feels this requirement is necessary, I will complete the corresponding KEP and code implementation work.

The current solution within our company is like this, but I believe adding a 'check preferred' extension point would be better:
1、Enable users to assign a specific annotation to pods with the key "xxx.k8s.io/preferred-plugin". The value of this annotation can be either "NodeAffinity" or "TaintToleration".

2、Determine which preferred feature to utilize during scheduling based on the annotation value.

NodeAffinity:

checkPreferred = func(node *v1.Node, pod *v1.Pod) bool {
    affinity := pod.Spec.Affinity
    if affinity != nil && affinity.NodeAffinity != nil && affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution != nil {
        terms, err := corev1nodeaffinity.NewPreferredSchedulingTerms(affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution)
        if err != nil {
            klog.ErrorS(err, "failed to parse pod's nodeaffinity", "pod", klog.KObj(pod))
            return false
        }
        if terms != nil && terms.Score(node) > 0 {
            return true
        }
    }
    return false
}

TaintToleration:

checkPreferred = func(node *v1.Node, pod *v1.Pod) bool {
    var filterTolerations []v1.Toleration
    for _, toleration := range pod.Spec.Tolerations {
        if toleration.Effect != v1.TaintEffectPreferNoSchedule {
            continue
        }
        filterTolerations = append(filterTolerations, toleration)
    }
    if len(node.Spec.Taints) != 0 && len(filterTolerations) != 0 {
        for _, taint := range node.Spec.Taints {
            // check only on taints that have effect PreferNoSchedule
            if taint.Effect != v1.TaintEffectPreferNoSchedule {
                continue
            }
 
            if v1helper.TolerationsTolerateTaint(filterTolerations, &taint) {
                return true
            }
        }
    }
    return false
}

3、 Divide nodes into two groups, "passChecked" and "noPassChecked", based on whether they satisfy the preferred check.

4、To ensure equal scheduling probabilities for each node, randomly sort the "passChecked" and "noPassChecked" groups.

5、Reconstruct the nodes array by combining the "passChecked" and "noPassChecked" groups, ensuring that "passChecked" nodes come before "noPassChecked" nodes.

6、Call the "findNodesThatPassFilters" method to search for feasible nodes in the new nodes array.

7、If the length of "passChecked" is 0, adjust the value of "nextStartNodeIndex"; otherwise, leave it unchanged.

Why is this needed?

Currently, for performance reasons, the kube-scheduler follows this scheduling logic:
1、It starts filtering feasible nodes from the nextStartNodeIndex. It stops filtering after a specific number of nodes are filtered out that satisfy the Filter plugin (by default, this number is 100).

2、Then, it applies Score plugins to assign scores to these feasible nodes.

3、Finally, it selects the node with the highest score for scheduling.

However, because each scheduling attempt operates within a partial range and there are multiple Score plugins, this often results in pods not being scheduled onto the nodes users expect.

If we can add an new extension to check nodes, then we can prioritize scheduling pods onto the desired nodes.

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

I assuming that your goal is to (try to)make sure a pod with preferred affinity and taint toleration to be scheduled to a node which matches node affinity and also has the tolerated taint?
Any specific user case for this behavior?

I assuming that your goal is to (try to)make sure a pod with preferred affinity and taint toleration to be scheduled to a node which matches node affinity and also has the tolerated taint? Any specific user case for this behavior?

In the case of available pod resources, I want pods to be scheduled onto specific nodes as much as possible. However, the numerous score plugins enabled in the cluster, along with their predefined weights set by SREs, make it challenging for users to dynamically adjust them. Meanwhile, due to performance considerations, the scheduler only traverses and evaluates a subset of nodes. This often leads to suboptimal scheduling results.