nebuly-ai/nos

pod stuck pending at resource overuse

selinnilesy opened this issue · 1 comments

Hi,

I am allocating only 1gb of 24gb available memory to gpu operator that is shown in my node's labels. I also have another gpu device plugin (the default one) in my cluster but I have done the necessary affinity configurations to prevent both running. Basically, my pod stucks at pending (the sleep pod that is shared on documentation) with the reasoning of resource overuse, and does not get scheduled. MPS server occupies even less than 1gb on my gpu, and seems to be running in the output of nvidia-smi.

I have followed steps in doc about user mode 1000 and necessary gpu-operator config arrangements (mig mode mixed etc.)

Any help would be much appreciated.

NAMESPACE                NAME                                                          READY   STATUS      RESTARTS      AGE
calico-apiserver         calico-apiserver-6dd8b8765c-7nm86                             1/1     Running     0             23h
calico-apiserver         calico-apiserver-6dd8b8765c-fp6bx                             1/1     Running     0             23h
calico-system            calico-kube-controllers-5c8ddb5dcf-tv4fw                      1/1     Running     0             23h
calico-system            calico-node-hzxml                                             1/1     Running     0             23h
calico-system            calico-typha-d6688954-g547t                                   1/1     Running     0             23h
calico-system            csi-node-driver-4qfps                                         2/2     Running     0             23h
default                  gpu-feature-discovery-nqrtb                                   1/1     Running     0             85m
default                  gpu-operator-787cd6f58-xn68k                                  1/1     Running     0             85m
default                  gpu-pod                                                       0/1     Completed   0             3h35m
default                  mps-partitioning-example                                      0/1     Pending     0             3m16s
default                  nvidia-container-toolkit-daemonset-dj7xv                      1/1     Running     0             85m
default                  nvidia-cuda-validator-4pmjv                                   0/1     Completed   0             56m
default                  nvidia-dcgm-exporter-pwfwb                                    1/1     Running     0             85m
default                  nvidia-device-plugin-daemonset-7p4b7                          1/1     Running     0             85m
default                  nvidia-operator-validator-fr897                               1/1     Running     0             85m
default                  release-name-node-feature-discovery-gc-5cbdb95596-9p5bn       1/1     Running     0             88m
default                  release-name-node-feature-discovery-master-788d855b45-fsz56   1/1     Running     0             88m
default                  release-name-node-feature-discovery-worker-dgcn5              1/1     Running     0             39m
kube-system              coredns-5dd5756b68-tgdgf                                      1/1     Running     0             23h
kube-system              coredns-5dd5756b68-wlxq2                                      1/1     Running     0             23h
kube-system              etcd-selin-csl                                                1/1     Running     1553          23h
kube-system              kube-apiserver-selin-csl                                      1/1     Running     30            23h
kube-system              kube-controller-manager-selin-csl                             1/1     Running     0             23h
kube-system              kube-proxy-lslfg                                              1/1     Running     0             23h
kube-system              kube-scheduler-selin-csl                                      1/1     Running     35            23h
nebuly-nvidia            nvidia-device-plugin-1698187396-r7tpf                         3/3     Running     0             32m
node-feature-discovery   nfd-6q9tl                                                     2/2     Running     0             14m
node-feature-discovery   nfd-master-85f4bc48cf-dlw4q                                   1/1     Running     0             42m
node-feature-discovery   nfd-worker-wln6p                                              1/1     Running     2 (42m ago)   42m
tigera-operator          tigera-operator-94d7f7696-ff7kf                               1/1     Running     0             23h
selin@selin-csl:~$ kubectl describe node selin-csl
Name:               selin-csl
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true
                    feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true
                    feature.node.kubernetes.io/cpu-cpuid.FLUSH_L1D=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.FXSR=true
                    feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true
                    feature.node.kubernetes.io/cpu-cpuid.IA32_ARCH_CAP=true
                    feature.node.kubernetes.io/cpu-cpuid.IBPB=true
                    feature.node.kubernetes.io/cpu-cpuid.LAHF=true
                    feature.node.kubernetes.io/cpu-cpuid.MD_CLEAR=true
                    feature.node.kubernetes.io/cpu-cpuid.MOVBE=true
                    feature.node.kubernetes.io/cpu-cpuid.MPX=true
                    feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true
                    feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD=true
                    feature.node.kubernetes.io/cpu-cpuid.STIBP=true
                    feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true
                    feature.node.kubernetes.io/cpu-cpuid.SYSEE=true
                    feature.node.kubernetes.io/cpu-cpuid.VMX=true
                    feature.node.kubernetes.io/cpu-cpuid.X87=true
                    feature.node.kubernetes.io/cpu-cpuid.XGETBV1=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVE=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVES=true
                    feature.node.kubernetes.io/cpu-cstate.enabled=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/cpu-model.family=6
                    feature.node.kubernetes.io/cpu-model.id=85
                    feature.node.kubernetes.io/cpu-model.vendor_id=Intel
                    feature.node.kubernetes.io/cpu-pstate.scaling_governor=powersave
                    feature.node.kubernetes.io/cpu-pstate.status=active
                    feature.node.kubernetes.io/cpu-pstate.turbo=true
                    feature.node.kubernetes.io/cpu-rdt.RDTCMT=true
                    feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMBA=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMBM=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMON=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
                    feature.node.kubernetes.io/kernel-version.full=6.2.0-34-generic
                    feature.node.kubernetes.io/kernel-version.major=6
                    feature.node.kubernetes.io/kernel-version.minor=2
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/pci-0300_1002.present=true
                    feature.node.kubernetes.io/pci-0300_10de.present=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=ubuntu
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=22
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=selin-csl
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
                    nos.nebuly.com/gpu-partitioning=mps
                    nvidia.com/cuda.driver.major=535
                    nvidia.com/cuda.driver.minor=113
                    nvidia.com/cuda.driver.rev=01
                    nvidia.com/cuda.runtime.major=12
                    nvidia.com/cuda.runtime.minor=2
                    nvidia.com/gfd.timestamp=1698184228
                    nvidia.com/gpu-driver-upgrade-state=upgrade-done
                    nvidia.com/gpu.compute.major=7
                    nvidia.com/gpu.compute.minor=5
                    nvidia.com/gpu.count=1
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=pre-installed
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=turing
                    nvidia.com/gpu.machine=Precision-5820-Tower
                    nvidia.com/gpu.memory=24576
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-TITAN-RTX
                    nvidia.com/gpu.replicas=1
                    nvidia.com/mig.capable=false
                    nvidia.com/mig.strategy=mixed

As an update, I have obtained the logs of nebuly-nos gpu-agent that crashes with CrashLoopBackOff error:

selin@selin-csl:~$ kubectl logs nos-1698250532-gpu-agent-gqfdz -n nebuly-nos Defaulted container "nos-1698250532-gpu-agent" out of: nos-1698250532-gpu-agent, kube-rbac-proxy {"level":"info","ts":1698250893.187797,"logger":"setup","msg":"loaded config","reportingInterval":10} {"level":"info","ts":1698250893.5845299,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":"127.0.0.1:8080"} {"level":"info","ts":1698250893.6817641,"logger":"setup","msg":"Initializing NVML client"} /gpuagent: symbol lookup error: /gpuagent: undefined symbol: nvmlErrorString

I am assuming the nebuly-nos nvml does not link properly to my .so files, which are under /usr/lib/x86_64-linux-gnu. Is there a way to fix this by specifying my path?

Also, when trying with default plugin (with affinity and label on my node), I observe 0 allocatable for nvidia.gpu in my node. This is the log of default plugin :
selin@selin-csl:~$ kubectl -n kube-system logs nvidia-device-plugin-daemonset-x7tf6
I1025 20:56:31.907780 1 main.go:154] Starting FS watcher.
I1025 20:56:31.907824 1 main.go:161] Starting OS watcher.
I1025 20:56:31.908033 1 main.go:176] Starting Plugins.
I1025 20:56:31.908044 1 main.go:234] Loading configuration.
I1025 20:56:31.908130 1 main.go:242] Updating config with default resource matching patterns.
I1025 20:56:31.908260 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I1025 20:56:31.908267 1 main.go:256] Retreiving plugins.
W1025 20:56:31.908466 1 factory.go:31] No valid resources detected, creating a null CDI handler
I1025 20:56:31.908495 1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I1025 20:56:31.908518 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E1025 20:56:31.908523 1 factory.go:115] Incompatible platform detected
E1025 20:56:31.908526 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E1025 20:56:31.908528 1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E1025 20:56:31.908530 1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E1025 20:56:31.908532 1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I1025 20:56:31.908536 1 main.go:287] No devices found. Waiting indefinitely.