kubernetes/kubernetes

v1.30: kube-scheduler crashes with: Observed a panic: "integer divide by zero"

mikkeloscar opened this issue · 17 comments

What happened?

On Kubernetes v1.30.0 (and v1.30.1), kube-scheduler can crash with the following panic if a pod is defined in a certain way:

W0514 09:09:41.391780       1 feature_gate.go:246] Setting GA feature gate MinDomainsInPodTopologySpread=true. It will be removed in a future release.
I0514 09:09:43.191448       1 serving.go:380] Generated self-signed cert in-memory
W0514 09:09:43.574824       1 authentication.go:446] failed to read in-cluster kubeconfig for delegated authentication: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0514 09:09:43.574862       1 authentication.go:339] No authentication-kubeconfig provided in order to lookup client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication won't work.
W0514 09:09:43.574871       1 authentication.go:363] No authentication-kubeconfig provided in order to lookup requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.
W0514 09:09:43.574885       1 authorization.go:225] failed to read in-cluster kubeconfig for delegated authorization: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0514 09:09:43.574992       1 authorization.go:193] No authorization-kubeconfig provided, so SubjectAccessReview of authorization tokens won't work.
I0514 09:09:43.586307       1 server.go:154] "Starting Kubernetes Scheduler" version="v1.30.0-zalando-master-117-dirty"
I0514 09:09:43.586327       1 server.go:156] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I0514 09:09:43.587758       1 secure_serving.go:213] Serving securely on [::]:10259
I0514 09:09:43.587944       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0514 09:09:43.688733       1 leaderelection.go:250] attempting to acquire leader lease kube-system/kube-scheduler...
I0514 09:09:59.030033       1 leaderelection.go:260] successfully acquired lease kube-system/kube-scheduler
E0514 09:09:59.122043       1 runtime.go:79] Observed a panic: "integer divide by zero" (runtime error: integer divide by zero)
goroutine 330 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1b68bc0, 0x35e1c40})
	k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x7c
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x4000be8e00?})
	k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x78
panic({0x1b68bc0?, 0x35e1c40?})
	runtime/panic.go:770 +0x124
k8s.io/kubernetes/pkg/scheduler.(*Scheduler).findNodesThatFitPod(0x40000d6900, {0x22aae08, 0x40012b9630}, {0x22d2bb8, 0x4000320fc8}, 0x4000347dc0, 0x4000cb4488)
	k8s.io/kubernetes/pkg/scheduler/schedule_one.go:502 +0x88c
k8s.io/kubernetes/pkg/scheduler.(*Scheduler).schedulePod(0x40000d6900, {0x22aae08, 0x40012b9630}, {0x22d2bb8, 0x4000320fc8}, 0x4000347dc0, 0x4000cb4488)
	k8s.io/kubernetes/pkg/scheduler/schedule_one.go:402 +0x25c
k8s.io/kubernetes/pkg/scheduler.(*Scheduler).schedulingCycle(0x40000d6900, {0x22aae08, 0x40012b9630}, 0x4000347dc0, {0x22d2bb8, 0x4000320fc8}, 0x4000c92e10, {0x2?, 0x2?, 0x36352c0?}, ...)
	k8s.io/kubernetes/pkg/scheduler/schedule_one.go:149 +0xb8
k8s.io/kubernetes/pkg/scheduler.(*Scheduler).ScheduleOne(0x40000d6900, {0x22aae08, 0x40012b83c0})
	k8s.io/kubernetes/pkg/scheduler/schedule_one.go:111 +0x4c0
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
	k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x2c
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x400161dec8?)
	k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x40
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x400161df68, {0x2286fe0, 0x4001398b70}, 0x1, 0x40012bab40)
	k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0x90
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x4000669f68, 0x0, 0x0, 0x1, 0x40012bab40)
	k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x80
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x22aae08, 0x40012b83c0}, 0x4001385720, 0x0, 0x0, 0x1)
	k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x80
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(...)
	k8s.io/apimachinery/pkg/util/wait/backoff.go:170
created by k8s.io/kubernetes/pkg/scheduler.(*Scheduler).Run in goroutine 352
	k8s.io/kubernetes/pkg/scheduler/scheduler.go:445 +0x104
panic: runtime error: integer divide by zero [recovered]
	panic: runtime error: integer divide by zero

goroutine 330 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x4000be8e00?})
	k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xe0
panic({0x1b68bc0?, 0x35e1c40?})
	runtime/panic.go:770 +0x124
k8s.io/kubernetes/pkg/scheduler.(*Scheduler).findNodesThatFitPod(0x40000d6900, {0x22aae08, 0x40012b9630}, {0x22d2bb8, 0x4000320fc8}, 0x4000347dc0, 0x4000cb4488)
	k8s.io/kubernetes/pkg/scheduler/schedule_one.go:502 +0x88c
k8s.io/kubernetes/pkg/scheduler.(*Scheduler).schedulePod(0x40000d6900, {0x22aae08, 0x40012b9630}, {0x22d2bb8, 0x4000320fc8}, 0x4000347dc0, 0x4000cb4488)
	k8s.io/kubernetes/pkg/scheduler/schedule_one.go:402 +0x25c
k8s.io/kubernetes/pkg/scheduler.(*Scheduler).schedulingCycle(0x40000d6900, {0x22aae08, 0x40012b9630}, 0x4000347dc0, {0x22d2bb8, 0x4000320fc8}, 0x4000c92e10, {0x2?, 0x2?, 0x36352c0?}, ...)
	k8s.io/kubernetes/pkg/scheduler/schedule_one.go:149 +0xb8
k8s.io/kubernetes/pkg/scheduler.(*Scheduler).ScheduleOne(0x40000d6900, {0x22aae08, 0x40012b83c0})
	k8s.io/kubernetes/pkg/scheduler/schedule_one.go:111 +0x4c0
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
	k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x2c
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x400161dec8?)
	k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x40
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x400161df68, {0x2286fe0, 0x4001398b70}, 0x1, 0x40012bab40)
	k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0x90
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x4000669f68, 0x0, 0x0, 0x1, 0x40012bab40)
	k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x80
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x22aae08, 0x40012b83c0}, 0x4001385720, 0x0, 0x0, 0x1)
	k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x80
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(...)
	k8s.io/apimachinery/pkg/util/wait/backoff.go:170
created by k8s.io/kubernetes/pkg/scheduler.(*Scheduler).Run in goroutine 352
	k8s.io/kubernetes/pkg/scheduler/scheduler.go:445 +0x104

The crash happens here because the len(nodes) is 0 in certain cases.

What did you expect to happen?

kube-scheduler should not crash, on v1.29.4 it doesn't happen.

On v1.29.4 the kube-scheduler is just printing these error logs, but doesn't crash:

E0517 13:19:26.611504       1 schedule_one.go:1003] "Error scheduling pod; retrying" err="nodeinfo not found for node name \"invalid-node\"" pod="default/break-kube-scheduler"

How can we reproduce it (as minimally and precisely as possible)?

Create a pod like this:

apiVersion: v1
kind: Pod
metadata:
  name: break-kube-scheduler
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - invalid-node # a node that doesn't exist
  containers:
  - name: main
    image: alpine
    command: ["cat"]
    stdin: true

The important part is that the affinity doesn't match a real/valid node.

Once this pod is being processed by kube-scheduler it will crash and continue to do so until the pod is deleted.

Anything else we need to know?

This issue was triggered during the rotation of control-plane nodes. In our setup we update the control plane by scaling from 1 to 2 instances. Once both a ready, the old is being terminated via EC2 API. At this stage the kube-controller-manager sometimes manages to create a replacement daemonset pod once the old node is being deleted. This results in a pod targeting a no longer existing node via affinity as illustrated in the example above. For some reason, when the kube-scheduler is crashing the kube-controller-manager doesn't delete the extra/invalid daemonset pod. Not sure if this is another issue or it has always happened in our setup, but only v1.30 makes kube-scheduler crash which causes an actual issue.

Kubernetes version

$ kubectl version
Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.1

Cloud provider

We run a custom Kubernetes setup on AWS EC2 (not EKS).

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux ip-10-149-91-92 6.5.0-1020-aws #20~22.04.1-Ubuntu SMP Wed May  1 16:38:06 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Custom tooling: https://github.com/zalando-incubator/kubernetes-on-aws

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

/sig scheduling

cc @alculquicondor
/priority important-soon

@chengjoey @AxeZhan can you take a look?

I believe all the latest patch releases are affected too, because of #124559

/priority critical-urgent

This crash happens when preFilter plugins filtered out all nodes ...

@alculquicondor

sched.nextStartNodeIndex = (sched.nextStartNodeIndex + processedNodes) % len(nodes)

If preFilter filtered out some nodes, here len(nodes) will be a subset of allNodes.
Why are we using this subset at first? I think %allNodes should works fine?(seems logical to me)

I think so. In general, we just want to try a different set of nodes.

For the case of Daemonsets, it doesn't really matter, as we will just test one node.

/assign

Successfully catched this with a unit test.
I'll file a small pr with fix #124930 (comment) and a newly added UT.