pod stuck pending at resource overuse
selinnilesy opened this issue · 1 comments
Hi,
I am allocating only 1gb of 24gb available memory to gpu operator that is shown in my node's labels. I also have another gpu device plugin (the default one) in my cluster but I have done the necessary affinity configurations to prevent both running. Basically, my pod stucks at pending (the sleep pod that is shared on documentation) with the reasoning of resource overuse, and does not get scheduled. MPS server occupies even less than 1gb on my gpu, and seems to be running in the output of nvidia-smi.
I have followed steps in doc about user mode 1000 and necessary gpu-operator config arrangements (mig mode mixed etc.)
Any help would be much appreciated.
NAMESPACE NAME READY STATUS RESTARTS AGE
calico-apiserver calico-apiserver-6dd8b8765c-7nm86 1/1 Running 0 23h
calico-apiserver calico-apiserver-6dd8b8765c-fp6bx 1/1 Running 0 23h
calico-system calico-kube-controllers-5c8ddb5dcf-tv4fw 1/1 Running 0 23h
calico-system calico-node-hzxml 1/1 Running 0 23h
calico-system calico-typha-d6688954-g547t 1/1 Running 0 23h
calico-system csi-node-driver-4qfps 2/2 Running 0 23h
default gpu-feature-discovery-nqrtb 1/1 Running 0 85m
default gpu-operator-787cd6f58-xn68k 1/1 Running 0 85m
default gpu-pod 0/1 Completed 0 3h35m
default mps-partitioning-example 0/1 Pending 0 3m16s
default nvidia-container-toolkit-daemonset-dj7xv 1/1 Running 0 85m
default nvidia-cuda-validator-4pmjv 0/1 Completed 0 56m
default nvidia-dcgm-exporter-pwfwb 1/1 Running 0 85m
default nvidia-device-plugin-daemonset-7p4b7 1/1 Running 0 85m
default nvidia-operator-validator-fr897 1/1 Running 0 85m
default release-name-node-feature-discovery-gc-5cbdb95596-9p5bn 1/1 Running 0 88m
default release-name-node-feature-discovery-master-788d855b45-fsz56 1/1 Running 0 88m
default release-name-node-feature-discovery-worker-dgcn5 1/1 Running 0 39m
kube-system coredns-5dd5756b68-tgdgf 1/1 Running 0 23h
kube-system coredns-5dd5756b68-wlxq2 1/1 Running 0 23h
kube-system etcd-selin-csl 1/1 Running 1553 23h
kube-system kube-apiserver-selin-csl 1/1 Running 30 23h
kube-system kube-controller-manager-selin-csl 1/1 Running 0 23h
kube-system kube-proxy-lslfg 1/1 Running 0 23h
kube-system kube-scheduler-selin-csl 1/1 Running 35 23h
nebuly-nvidia nvidia-device-plugin-1698187396-r7tpf 3/3 Running 0 32m
node-feature-discovery nfd-6q9tl 2/2 Running 0 14m
node-feature-discovery nfd-master-85f4bc48cf-dlw4q 1/1 Running 0 42m
node-feature-discovery nfd-worker-wln6p 1/1 Running 2 (42m ago) 42m
tigera-operator tigera-operator-94d7f7696-ff7kf 1/1 Running 0 23h
selin@selin-csl:~$ kubectl describe node selin-csl
Name: selin-csl
Roles: control-plane
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
feature.node.kubernetes.io/cpu-cpuid.ADX=true
feature.node.kubernetes.io/cpu-cpuid.AESNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX=true
feature.node.kubernetes.io/cpu-cpuid.AVX2=true
feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true
feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true
feature.node.kubernetes.io/cpu-cpuid.FLUSH_L1D=true
feature.node.kubernetes.io/cpu-cpuid.FMA3=true
feature.node.kubernetes.io/cpu-cpuid.FXSR=true
feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true
feature.node.kubernetes.io/cpu-cpuid.IA32_ARCH_CAP=true
feature.node.kubernetes.io/cpu-cpuid.IBPB=true
feature.node.kubernetes.io/cpu-cpuid.LAHF=true
feature.node.kubernetes.io/cpu-cpuid.MD_CLEAR=true
feature.node.kubernetes.io/cpu-cpuid.MOVBE=true
feature.node.kubernetes.io/cpu-cpuid.MPX=true
feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true
feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD=true
feature.node.kubernetes.io/cpu-cpuid.STIBP=true
feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true
feature.node.kubernetes.io/cpu-cpuid.SYSEE=true
feature.node.kubernetes.io/cpu-cpuid.VMX=true
feature.node.kubernetes.io/cpu-cpuid.X87=true
feature.node.kubernetes.io/cpu-cpuid.XGETBV1=true
feature.node.kubernetes.io/cpu-cpuid.XSAVE=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true
feature.node.kubernetes.io/cpu-cpuid.XSAVES=true
feature.node.kubernetes.io/cpu-cstate.enabled=true
feature.node.kubernetes.io/cpu-hardware_multithreading=true
feature.node.kubernetes.io/cpu-model.family=6
feature.node.kubernetes.io/cpu-model.id=85
feature.node.kubernetes.io/cpu-model.vendor_id=Intel
feature.node.kubernetes.io/cpu-pstate.scaling_governor=powersave
feature.node.kubernetes.io/cpu-pstate.status=active
feature.node.kubernetes.io/cpu-pstate.turbo=true
feature.node.kubernetes.io/cpu-rdt.RDTCMT=true
feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true
feature.node.kubernetes.io/cpu-rdt.RDTMBA=true
feature.node.kubernetes.io/cpu-rdt.RDTMBM=true
feature.node.kubernetes.io/cpu-rdt.RDTMON=true
feature.node.kubernetes.io/kernel-config.NO_HZ=true
feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
feature.node.kubernetes.io/kernel-version.full=6.2.0-34-generic
feature.node.kubernetes.io/kernel-version.major=6
feature.node.kubernetes.io/kernel-version.minor=2
feature.node.kubernetes.io/kernel-version.revision=0
feature.node.kubernetes.io/pci-0300_1002.present=true
feature.node.kubernetes.io/pci-0300_10de.present=true
feature.node.kubernetes.io/storage-nonrotationaldisk=true
feature.node.kubernetes.io/system-os_release.ID=ubuntu
feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04
feature.node.kubernetes.io/system-os_release.VERSION_ID.major=22
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
kubernetes.io/arch=amd64
kubernetes.io/hostname=selin-csl
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node.kubernetes.io/exclude-from-external-load-balancers=
nos.nebuly.com/gpu-partitioning=mps
nvidia.com/cuda.driver.major=535
nvidia.com/cuda.driver.minor=113
nvidia.com/cuda.driver.rev=01
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=2
nvidia.com/gfd.timestamp=1698184228
nvidia.com/gpu-driver-upgrade-state=upgrade-done
nvidia.com/gpu.compute.major=7
nvidia.com/gpu.compute.minor=5
nvidia.com/gpu.count=1
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=pre-installed
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.node-status-exporter=true
nvidia.com/gpu.deploy.operator-validator=true
nvidia.com/gpu.family=turing
nvidia.com/gpu.machine=Precision-5820-Tower
nvidia.com/gpu.memory=24576
nvidia.com/gpu.present=true
nvidia.com/gpu.product=NVIDIA-TITAN-RTX
nvidia.com/gpu.replicas=1
nvidia.com/mig.capable=false
nvidia.com/mig.strategy=mixed
As an update, I have obtained the logs of nebuly-nos gpu-agent that crashes with CrashLoopBackOff error:
selin@selin-csl:~$ kubectl logs nos-1698250532-gpu-agent-gqfdz -n nebuly-nos Defaulted container "nos-1698250532-gpu-agent" out of: nos-1698250532-gpu-agent, kube-rbac-proxy {"level":"info","ts":1698250893.187797,"logger":"setup","msg":"loaded config","reportingInterval":10} {"level":"info","ts":1698250893.5845299,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":"127.0.0.1:8080"} {"level":"info","ts":1698250893.6817641,"logger":"setup","msg":"Initializing NVML client"} /gpuagent: symbol lookup error: /gpuagent: undefined symbol: nvmlErrorString
I am assuming the nebuly-nos nvml does not link properly to my .so files, which are under /usr/lib/x86_64-linux-gnu. Is there a way to fix this by specifying my path?
Also, when trying with default plugin (with affinity and label on my node), I observe 0 allocatable for nvidia.gpu in my node. This is the log of default plugin :
selin@selin-csl:~$ kubectl -n kube-system logs nvidia-device-plugin-daemonset-x7tf6
I1025 20:56:31.907780 1 main.go:154] Starting FS watcher.
I1025 20:56:31.907824 1 main.go:161] Starting OS watcher.
I1025 20:56:31.908033 1 main.go:176] Starting Plugins.
I1025 20:56:31.908044 1 main.go:234] Loading configuration.
I1025 20:56:31.908130 1 main.go:242] Updating config with default resource matching patterns.
I1025 20:56:31.908260 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I1025 20:56:31.908267 1 main.go:256] Retreiving plugins.
W1025 20:56:31.908466 1 factory.go:31] No valid resources detected, creating a null CDI handler
I1025 20:56:31.908495 1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I1025 20:56:31.908518 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E1025 20:56:31.908523 1 factory.go:115] Incompatible platform detected
E1025 20:56:31.908526 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E1025 20:56:31.908528 1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E1025 20:56:31.908530 1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E1025 20:56:31.908532 1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I1025 20:56:31.908536 1 main.go:287] No devices found. Waiting indefinitely.