Problem's enabling GPU - Workaround included
hansesm opened this issue · 2 comments
Summary
The default GPU | NVIDIA addon does not find the correct drivers and thus containers are crashing.
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
What Should Happen Instead?
Everything should work after enabling GPU-Addon.
microk8s enable nvidia
Reproduction Steps
microk8s enable nvidia
Infer repository core for addon nvidia
Addon core/dns is already enabled
Addon core/helm3 is already enabled
WARNING: --set-as-default-runtime is deprecated, please use --gpu-operator-toolkit-version instead
"nvidia" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
Deploy NVIDIA GPU operator
Using auto GPU driver
W1222 14:39:49.104108 1716891 warnings.go:70] unknown field "spec.daemonsets.rollingUpdate"
W1222 14:39:49.104132 1716891 warnings.go:70] unknown field "spec.daemonsets.updateStrategy"
NAME: gpu-operator
LAST DEPLOYED: Fri Dec 22 14:39:47 2023
NAMESPACE: gpu-operator-resources
STATUS: deployed
REVISION: 1
TEST SUITE: None
Deployed NVIDIA GPU operator
microk8s kubectl get pods --namespace gpu-operator-resources
NAME READY STATUS RESTARTS AGE
gpu-operator-node-feature-discovery-worker-ldvbf 1/1 Running 0 4m42s
gpu-operator-559f7cd69b-7cqhm 1/1 Running 0 4m42s
gpu-operator-node-feature-discovery-master-5bfbc54c8d-hppfr 1/1 Running 0 4m42s
gpu-feature-discovery-zrp99 0/1 Init:CrashLoopBackOff 5 (91s ago) 4m21s
nvidia-operator-validator-hxfbf 0/1 Init:CrashLoopBackOff 5 (89s ago) 4m22s
nvidia-device-plugin-daemonset-xmvvr 0/1 Init:CrashLoopBackOff 5 (85s ago) 4m22s
nvidia-container-toolkit-daemonset-shdrn 0/1 Init:CrashLoopBackOff 5 (80s ago) 4m22s
nvidia-dcgm-exporter-96gmz 0/1 Init:CrashLoopBackOff 5 (77s ago) 4m21s
microk8s kubectl describe pod nvidia-operator-validator-hxfbf -n gpu-operator-resources
Name: nvidia-operator-validator-hxfbf
Namespace: gpu-operator-resources
Priority: 2000001000
Priority Class Name: system-node-critical
Runtime Class Name: nvidia
Service Account: nvidia-operator-validator
Node: gpu01/132.176.10.80
Start Time: Fri, 22 Dec 2023 14:40:09 +0100
Labels: app=nvidia-operator-validator
app.kubernetes.io/part-of=gpu-operator
controller-revision-hash=6bd5fd4488
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: b921d851a8c76ad40b2f18e285c2b61d7f7300fd471f8ac751ca401bf9a32ded
cni.projectcalico.org/podIP: 10.1.69.188/32
cni.projectcalico.org/podIPs: 10.1.69.188/32
Status: Pending
IP: 10.1.69.188
IPs:
IP: 10.1.69.188
Controlled By: DaemonSet/nvidia-operator-validator
Init Containers:
driver-validation:
Container ID: containerd://2d9d39b1bbf489f5fc99c451a463935d8f63d5faddefac4305f7c849710eb7a5
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:18c9ea88ae06d479e6657b8a4126a8ee3f4300a40c16ddc29fb7ab3763d46005
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: StartError
Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error d uring container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
Exit Code: 128
Started: Thu, 01 Jan 1970 01:00:00 +0100
Finished: Fri, 22 Dec 2023 14:45:44 +0100
Ready: False
Restart Count: 6
Environment:
WITH_WAIT: true
COMPONENT: driver
Mounts:
/host from host-root (ro)
/run/nvidia/driver from driver-install-path (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8xhcc (ro)
toolkit-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
WITH_WAIT: false
COMPONENT: toolkit
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8xhcc (ro)
cuda-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
WITH_WAIT: false
COMPONENT: cuda
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: gpu-operator-resources (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
VALIDATOR_RUNTIME_CLASS: nvidia
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8xhcc (ro)
plugin-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
COMPONENT: plugin
WITH_WAIT: false
WITH_WORKLOAD: true
MIG_STRATEGY: single
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: gpu-operator-resources (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
VALIDATOR_RUNTIME_CLASS: nvidia
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8xhcc (ro)
Containers:
nvidia-operator-validator:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
echo all validations are successful; sleep infinity
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8xhcc (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
kube-api-access-8xhcc:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.operator-validator=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 2m57s (x26 over 8m3s) kubelet Back-off restarting failed container driver-validation in pod nvidia-operator-validator-hxfbf_gpu-operator-resources (97c4f528-a16c-476b-a696-3c70cf6ed271)
nvidia-smi
Thu Dec 21 16:55:31 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A5000 On | 00000000:01:00.0 Off | Off |
| 30% 28C P8 7W / 230W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5000 On | 00000000:25:00.0 Off | Off |
| 30% 29C P8 6W / 230W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A5000 On | 00000000:41:00.0 Off | Off |
| 30% 28C P8 8W / 230W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A5000 On | 00000000:61:00.0 Off | Off |
| 30% 28C P8 5W / 230W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA RTX A5000 On | 00000000:81:00.0 Off | Off |
| 30% 27C P8 9W / 230W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA RTX A5000 On | 00000000:C1:00.0 Off | Off |
| 30% 27C P8 7W / 230W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA RTX A5000 On | 00000000:C4:00.0 Off | Off |
| 30% 27C P8 2W / 230W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA RTX A5000 On | 00000000:E1:00.0 Off | Off |
| 30% 27C P8 6W / 230W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
ls -la /run/nvidia/driver
total 0
drwxr-xr-x 2 root root 40 Dez 21 17:26 .
drwxr-xr-x 4 root root 80 Dez 21 17:26 ..
cat /etc/docker/daemon.json
{
"insecure-registries" : ["localhost:32000"],
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}}
cat /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
[nvidia-container-cli]
environment = []
ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real"
load-kmods = true
path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
root = "/run/nvidia/driver"
[nvidia-container-runtime]
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
Service snap.microk8s.daemon-cluster-agent is running
Service snap.microk8s.daemon-containerd is running
Service snap.microk8s.daemon-kubelite is running
Service snap.microk8s.daemon-k8s-dqlite is running
Service snap.microk8s.daemon-apiserver-kicker is running
Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
Copy processes list to the final report tarball
Copy disk usage information to the final report tarball
Copy memory usage information to the final report tarball
Copy server uptime to the final report tarball
Copy openSSL information to the final report tarball
Copy snap list to the final report tarball
Copy VM name (or none) to the final report tarball
Copy current linux distribution to the final report tarball
Copy asnycio usage and limits to the final report tarball
Copy inotify max_user_instances and max_user_watches to the final report tarball
Copy network configuration to the final report tarball
Inspecting kubernetes cluster
Inspect kubernetes cluster
Inspecting dqlite
Inspect dqlite
Building the report tarball
Report tarball is at /var/snap/microk8s/6089/inspection-report-20231221_170521.tar.gz
microk8s kubectl describe clusterpolicies --all-namespaces
Name: cluster-policy
Namespace:
Labels: app.kubernetes.io/component=gpu-operator
app.kubernetes.io/managed-by=Helm
Annotations: meta.helm.sh/release-name: gpu-operator
meta.helm.sh/release-namespace: gpu-operator-resources
API Version: nvidia.com/v1
Kind: ClusterPolicy
Metadata:
Creation Timestamp: 2023-12-21T16:16:44Z
Generation: 1
Resource Version: 105635519
UID: e20bbaad-bdaf-4c87-86dd-b2fcc3d8f88f
Spec:
Daemonsets:
Priority Class Name: system-node-critical
Tolerations:
Effect: NoSchedule
Key: nvidia.com/gpu
Operator: Exists
Dcgm:
Enabled: false
Host Port: 5555
Image: dcgm
Image Pull Policy: IfNotPresent
Repository: nvcr.io/nvidia/cloud-native
Version: 3.1.3-1-ubuntu20.04
Dcgm Exporter:
Enabled: true
Env:
Name: DCGM_EXPORTER_LISTEN
Value: :9400
Name: DCGM_EXPORTER_KUBERNETES
Value: true
Name: DCGM_EXPORTER_COLLECTORS
Value: /etc/dcgm-exporter/dcp-metrics-included.csv
Image: dcgm-exporter
Image Pull Policy: IfNotPresent
Repository: nvcr.io/nvidia/k8s
Service Monitor:
Additional Labels:
Enabled: false
Honor Labels: false
Interval: 15s
Version: 3.1.3-3.1.2-ubuntu20.04
Device Plugin:
Enabled: true
Env:
Name: PASS_DEVICE_SPECS
Value: true
Name: FAIL_ON_INIT_ERROR
Value: true
Name: DEVICE_LIST_STRATEGY
Value: envvar
Name: DEVICE_ID_STRATEGY
Value: uuid
Name: NVIDIA_VISIBLE_DEVICES
Value: all
Name: NVIDIA_DRIVER_CAPABILITIES
Value: all
Image: k8s-device-plugin
Image Pull Policy: IfNotPresent
Repository: nvcr.io/nvidia
Version: v0.13.0-ubi8
Driver:
Cert Config:
Name:
Enabled: false
Image: driver
Image Pull Policy: IfNotPresent
Kernel Module Config:
Name:
Licensing Config:
Config Map Name:
Nls Enabled: false
Manager:
Env:
Name: ENABLE_GPU_POD_EVICTION
Value: true
Name: ENABLE_AUTO_DRAIN
Value: true
Name: DRAIN_USE_FORCE
Value: false
Name: DRAIN_POD_SELECTOR_LABEL
Value:
Name: DRAIN_TIMEOUT_SECONDS
Value: 0s
Name: DRAIN_DELETE_EMPTYDIR_DATA
Value: false
Image: k8s-driver-manager
Image Pull Policy: IfNotPresent
Repository: nvcr.io/nvidia/cloud-native
Version: v0.5.1
Rdma:
Enabled: false
Use Host Mofed: false
Repo Config:
Config Map Name:
Repository: nvcr.io/nvidia
Version: 525.60.13
Virtual Topology:
Config:
Gfd:
Enabled: true
Env:
Name: GFD_SLEEP_INTERVAL
Value: 60s
Name: GFD_FAIL_ON_INIT_ERROR
Value: true
Image: gpu-feature-discovery
Image Pull Policy: IfNotPresent
Repository: nvcr.io/nvidia
Version: v0.7.0-ubi8
Mig:
Strategy: single
Mig Manager:
Config:
Name:
Enabled: true
Env:
Name: WITH_REBOOT
Value: false
Gpu Clients Config:
Name:
Image: k8s-mig-manager
Image Pull Policy: IfNotPresent
Repository: nvcr.io/nvidia/cloud-native
Version: v0.5.0-ubuntu20.04
Node Status Exporter:
Enabled: false
Image: gpu-operator-validator
Image Pull Policy: IfNotPresent
Repository: nvcr.io/nvidia/cloud-native
Version: v22.9.1
Operator:
Default Runtime: containerd
Init Container:
Image: cuda
Image Pull Policy: IfNotPresent
Repository: nvcr.io/nvidia
Version: 11.8.0-base-ubi8
Runtime Class: nvidia
Psp:
Enabled: false
Sandbox Device Plugin:
Enabled: true
Image: kubevirt-gpu-device-plugin
Image Pull Policy: IfNotPresent
Repository: nvcr.io/nvidia
Version: v1.2.1
Sandbox Workloads:
Default Workload: container
Enabled: false
Toolkit:
Enabled: true
Env:
Name: CONTAINERD_CONFIG
Value: /var/snap/microk8s/current/args/containerd-template.toml
Name: CONTAINERD_SOCKET
Value: /var/snap/microk8s/common/run/containerd.sock
Name: CONTAINERD_SET_AS_DEFAULT
Value: 0
Image: container-toolkit
Image Pull Policy: IfNotPresent
Install Dir: /usr/local/nvidia
Repository: nvcr.io/nvidia/k8s
Version: v1.11.0-ubuntu20.04
Validator:
Image: gpu-operator-validator
Image Pull Policy: IfNotPresent
Plugin:
Env:
Name: WITH_WORKLOAD
Value: true
Repository: nvcr.io/nvidia/cloud-native
Version: v22.9.1
Vfio Manager:
Driver Manager:
Env:
Name: ENABLE_AUTO_DRAIN
Value: false
Image: k8s-driver-manager
Image Pull Policy: IfNotPresent
Repository: nvcr.io/nvidia/cloud-native
Version: v0.5.1
Enabled: true
Image: cuda
Image Pull Policy: IfNotPresent
Repository: nvcr.io/nvidia
Version: 11.7.1-base-ubi8
Vgpu Device Manager:
Config:
Default: default
Name:
Enabled: true
Image: vgpu-device-manager
Image Pull Policy: IfNotPresent
Repository: nvcr.io/nvidia/cloud-native
Version: v0.2.0
Vgpu Manager:
Driver Manager:
Env:
Name: ENABLE_AUTO_DRAIN
Value: false
Image: k8s-driver-manager
Image Pull Policy: IfNotPresent
Repository: nvcr.io/nvidia/cloud-native
Version: v0.5.1
Enabled: false
Image: vgpu-manager
Image Pull Policy: IfNotPresent
Events: <none>
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
Can you suggest a fix?
Change values in:
/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
root = "/run/nvidia/driver"
to
root = "/"
/usr/local/nvidia/toolkit/nvidia-container-runtime
added:
"runtimes":
{
"nvidia": {
"path": "/usr/local/nvidia/toolkit/nvidia-container-runtime",
"runtimeArgs": [] }
}
Added symlink:
ln -s /sbin /run/nvidia/driver/sbin
restart k8s
microk8s stop
microk8s start
Then all containers are starting up correctly !
Best regards !
EDIT:
Found following issue containing the same issue:
NVIDIA/gpu-operator#511
Hi @hansesm @pappacena, thanks for the extended bug report and the documented steps. How are the GPU drivers installed/built on the systems in question?
The gpu-operator
will attempt to install the driver at /run/nvidia/driver
if no driver is loaded already. The steps above look like an installation where the gpu-operator
installed the driver, but then you switched to use the drivers from the host instead. The linked issue seems to describe the same problem.
An easier approach to this, ensuring that the host driver is used (if available) would be to enable the addon like this, depending on your scenario:
# make sure that host drivers are used
microk8s enable nvidia --gpu-operator-driver=host
# make sure that the operators builds and installs the nvidia drivers
microk8s enable nvidia --gpu-operator-driver=operator
Hope this helps! Can you try this on a clean system and report back? Thanks!