kubernetes-sigs/karpenter

Scheduling simulation seems to take previous antiAffinity "topologyKey" instead of new updated one

yogeek opened this issue · 9 comments

Description

Observed Behavior:

A "debug" nodepool is configured with a taint.

A deployment is deployed to this nodepool with :

  • 1 replica
  • a nodeSelector + toleration to go the a "debug" node
  • the default rolling update strategy (adding a new pod before deleting the old one)
  • an antiAffinity using the deprecated failure-domain.beta.kubernetes.io/hostname => why ? it is an actual use case where, investigating issues in karpenter nodes replacement after expiration, we found out that some of our users were still using this deprecated topologyKey in their antiaffinity config. As they are not able to fix this for now (production constraints), we are trying to find a way to unblock node replacement :
    • 1st we tried to add the deprecated label to all our nodes : but karpenter did not add new node to schedule the new pods. We guessed it may be caused by the fact that this label is deprecated : is it ?
    • 2nd we tried to fix the deprecated topologyKey (failure-domain.beta.kubernetes.io/hostname) by the valid one (kubernetes.io/hostname) in one deployment but here again, karpenter is not creating a new node. Hence the current issue.

Deployment :

spec:
  replicas: 1
  [...]
  template:
    spec:
      nodeSelector:
        node_group: debug
      tolerations:
      - effect: NoSchedule
        key: node_group
        operator: Equal
        value: debug
      # ----------------------- pod antiAffinity
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - foo
            namespaces:
            - foo
            topologyKey: failure-domain.beta.kubernetes.io/hostname

Karpenter is creating a node "debug" and the pod is scheduled there.
Only 1 "debug" node is existing for now.

We edit the deployment to update the topologyKey :

topologyKey: failure-domain.beta.kubernetes.io/hostname
replaced by
topologyKey: kubernetes.io/hostname

A rolling update is triggered :

  • a new pod is created and becomes "Pending" as it cannot be scheduled to the current debug node because of the pod antiAffinity (the current pod already being there)
  • karpenter does not create a new "debug" node to schedule this new pod and logs this
incompatible with nodepool \"debug\", daemonset overhead={\"cpu\":\"1121m\",\"memory\":\"1824Mi\",\"pods\":\"13\"}, 
unsatisfiable topology constraint for pod anti-affinity, key=failure-domain.beta.kubernetes.io/hostname 
(counts = map[ip-10-10-101-122.eu-central-1.compute.internal:1], 
podDomains = failure-domain.beta.kubernetes.io/hostname Exists, 
nodeDomains = failure-domain.beta.kubernetes.io/hostname Exists);

It is like karpenter still takes the old label into consideration for scheduling simulation whereas we fixed it with a valid one.

The new pod stays blocked in "Pending" state and the rolling update cannot succeed...

Expected Behavior:

I would understand that karpenter do not want to create a node because of the deprecated label but, after we fixed this label, I would expect karpenter to create a new "debug" node due to the antiAffinity for the new pod to be able to be scheduled to it.

Reproduction Steps (Please include YAML):

debug-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: debug
  labels:
    app: debug
spec:
  replicas: 1
  selector:
    matchLabels:
      app: debug
  template:
    metadata:
      labels:
        app: debug
    spec:
      containers:
      - name: pause-container
        image: k8s.gcr.io/pause:3.4.1
        resources:
          limits:
            cpu: '100m'
            memory: 40Mi
          requests:
            cpu: '10m'
            memory: 10Mi
      nodeSelector:
        node_group: debug
      tolerations:
      - effect: NoSchedule
        key: node_group
        operator: Equal
        value: debug
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - debug
            namespaces:
            - debug
            topologyKey: failure-domain.beta.kubernetes.io/hostname # <<< This will be commented later
            # topologyKey: kubernetes.io/hostname                              # <<< This will be uncommented later

Initial status : no "debug" node is present

  • Create the debug deployment
kubectl apply -f debug-deploy.yaml

Karpenter create a nodeClaim for the "debug" nodepool
Wait for a debug node to be up and the debug pod to be running on it

  • Try a rolling update
kubectl rollout restart deployment/debug

The new pod is in pending
Karpenter does not create a new nodeClaim for a debug node and logs this

incompatible with nodepool \"debug\", daemonset overhead={\"cpu\":\"1121m\",\"memory\":\"1824Mi\",\"pods\":\"13\"}, 
unsatisfiable topology constraint for pod anti-affinity, key=failure-domain.beta.kubernetes.io/hostname 
(counts = map[ip-10-10-XXX-YYY.eu-central-1.compute.internal:1], 
podDomains = failure-domain.beta.kubernetes.io/hostname Exists, 
nodeDomains = failure-domain.beta.kubernetes.io/hostname Exists);

Undo the rollout.

kubectl rollout undo deployment/debug 

Edit the deployment and replace the toplogyKey by the valid one topologyKey: kubernetes.io/hostname
(comment the deprecated one, uncomment the valid one)

This triggers a new rollout.
But the new pod stays in pending
Karpenter does not create a new nodeClaim for a debug node and logs this

NOTES :

  • if I create the deployment with the valid topologyKey since the beginning, all is working correctly.
  • if I pre-provision debug nodes, the scheduling is working correctly also

So it seems that karpenter is not taken into account the new topologyKey in its schedule simulation when we edit it after creation...?

Versions:

  • Chart Version: v0.33.0
  • Kubernetes Version (kubectl version): 1.27.14
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

I couldn’t find any information about ·failure-domain.beta.kubernetes.io/hostname·, so your concern might be invalid. However, I found similar keys that have been deprecated. u might want to refer to:

https://kubernetes.io/docs/reference/labels-annotations-taints/#failure-domainbetakubernetesioregion

@Vacant2333 thanks for your help
You are right, indeed, my mistake : failure-domain.kubernetes.io/hostname has never been a registered label (it was a bad guess coming from the fact that failure-domain.kubernetes.io/zone and failure-domain.kubernetes.io/region were before their deprecation)

However, I do not understand why, when I fix it by replacing it by the valid one kubernetes.io/hostname, karpenter is still complaining and mentionning the old one in the logs...

Seems like a bug to me but I am curious to know your thoughts on this

@Vacant2333 thanks for your help You are right, indeed, my mistake : failure-domain.kubernetes.io/hostname has never been a registered label (it was a bad guess coming from the fact that failure-domain.kubernetes.io/zone and failure-domain.kubernetes.io/region were before their deprecation)

However, I do not understand why, when I fix it by replacing it by the valid one kubernetes.io/hostname, karpenter is still complaining and mentionning the old one in the logs...

Seems like a bug to me but I am curious to know your thoughts on this

hi, can u show me the logs in detail, i will try to find the resone~

Sure @Vacant2333 here are the logs from karpenter pod just after I edit the deployment to update the topologyKey to topologyKey: kubernetes.io/hostname:

karpenter-55599bc687-p6zrz controller {"level":"ERROR","time":"2024-10-25T10:12:02.692Z","logger":"controller.provisioner","message":"Could not schedule pod, incompatible with nodepool \"ci\", daemonset overhead={\"cpu\":\"1121m\",\"memory\":\"1824Mi\",\"pods\":\"13\"}, did not tolerate node_group=ci:NoSchedule; incompatible with nodepool \"debug\", daemonset overhead={\"cpu\":\"1121m\",\"memory\":\"1824Mi\",\"pods\":\"13\"}, unsatisfiable topology constraint for pod anti-affinity, key=failure-domain.beta.kubernetes.io/hostname (counts = map[ip-10-10-102-117.eu-central-1.compute.internal:1], podDomains = failure-domain.beta.kubernetes.io/hostname Exists, nodeDomains = failure-domain.beta.kubernetes.io/hostname Exists); incompatible with nodepool \"monitoring\", daemonset overhead={\"cpu\":\"1121m\",\"memory\":\"1824Mi\",\"pods\":\"13\"}, did not tolerate node_group=monitoring:NoSchedule; incompatible with nodepool \"stable\", daemonset overhead={\"cpu\":\"1121m\",\"memory\":\"1824Mi\",\"pods\":\"13\"}, did not tolerate node_group=stable:NoSchedule; incompatible with nodepool \"standard\", daemonset overhead={\"cpu\":\"1121m\",\"memory\":\"1824Mi\",\"pods\":\"13\"}, incompatible requirements, key node_group, node_group In [debug] not in node_group In [standard]","commit":"2dd7fdc","pod":"debug/debug-5b444b6f5-578wb"}

the relevant part being the one from my first message :

incompatible with nodepool \"debug\", daemonset overhead={\"cpu\":\"1121m\",\"memory\":\"1824Mi\",\"pods\":\"13\"}, unsatisfiable topology constraint for pod anti-affinity, key=failure-domain.beta.kubernetes.io/hostname (counts = map[ip-10-10-102-117.eu-central-1.compute.internal:1], podDomains = failure-domain.beta.kubernetes.io/hostname Exists, nodeDomains = failure-domain.beta.kubernetes.io/hostname Exists); 

and the Events from the new "Pending" pod are :
image

Why karpenter still mentions the previous topologyKey instead of the new one...?

Did you find out what the reason is? It’s strange because Karpenter uses the latest pod information for each simulation scheduling. Has the old pod been completely deleted? i cant find the resaon on my enviorment cause cant reproduce

@Vacant2333 no we still have the issue and did not find the reason.

We have the issue right now :

  • all the deployments with the wrong topologyKey were fixed on the cluster
  • we edited the nodepools to update the AMI ID (upgrade k8s from 1.27 to 1.28)
  • karpenter starts rolling the nodes
  • some nodes were rolled successfully
  • some nodes are blocked because of the same error
not all pods would schedule, <NS>/<POD_ID> => 
incompatible with nodepool "standard", daemonset overhead={"cpu":"1121m","memory":"1824Mi","pods":"13"}, 
unsatisfiable topology constraint for pod anti-affinity, 
key=failure-domain.beta.kubernetes.io/hostname (counts = map[ip-10-10-XXX-YYY.eu-central-1.compute.internal:1], 
podDomains = failure-domain.beta.kubernetes.io/hostname Exists, 
nodeDomains = failure-domain.beta.kubernetes.io/hostname Exists);

and if I look to the corresponding deployment, the topologyKey is the right one (kubernetes.io/hostname}, the only place where I can find the failure-domain.beta.kubernetes.io/hostname mentionned in karpenter logs is the kubectl.kubernetes.io/last-applied-configuration annotation

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"Deployment",
      "metadata":{...}
      "spec":{...
         "affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"name","operator":"In","values":["configmanager"]}]},"namespaces":["<NS>"],"topologyKey":"failure-domain.beta.kubernetes.io/hostname"}]}},
      {...}
    mutated-by-kyverno-policy: mutate-deprecated-topologykey
  name: configmanager
  namespace: <NS>
spec:
  replicas: 1
  selector:
    matchLabels:
      app: configmanager
  template:
    metadata:
      labels:
        app: configmanager
        name: configmanager
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: name
                operator: In
                values:
                - configmanager
            namespaces:
            - keycore-debug-1
            topologyKey: kubernetes.io/hostname
[...]

Additionnal information : on the node that karpenter does not manage to consolidate, if I do a "kubectl drain" (without force), the node is correctly drained

UPDATE : in fact it was not due to the lastAppliedConfiguration at all...😅

After some more digging, we found out that the issue came from some deployments that were in a weird state :
1 deployment with 2 replicas pods, but with 2 different replicasets !!
The most recent replicaset contained the fix on the topologyKey so the corresponding pod did not block the rolling, but the second pod was attached to an old replicaset that contained the "wrong" topologyKey, and the rolling was blocked because of this second pod.

We do not understand how this is possible...

After removing the "wrong" pods and the corresponding replicasets, all is ok now.