gpu add on failing on 1.25
Closed this issue · 12 comments
Summary
GPU add on is not working as expected in kubernetes 1.25
Process
[List the steps to replicate the issue.]
- Install microk8s 1.25
sudo snap install microk8s --classic --channel=1.25/stable
- Enable other plugins
microk8s enable rbac hostpath-storage metallb ingress dns dashboard helm
- Enable gpu
microk8s enable gpu
After some time run microk8s kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator
. Looks like nvidia-validator is not installed.
Screenshot
[If relevant, include a screenshot.]
1.6654307708338156e+09 ERROR controller.clusterpolicy-controller Reconciler error {"name": "cluster-policy", "namespace": "", "error": "no matches for kind "RuntimeClass" in version "node.k8s.io/v1beta1""}
Browser details
[Optionally - if you can, copy the report generated by mybrowser.fyi - this might help us debug certain types of issues.]
Introspection Report
Inspecting system
Inspecting Certificates
Inspecting services
Service snap.microk8s.daemon-cluster-agent is running
Service snap.microk8s.daemon-containerd is running
Service snap.microk8s.daemon-kubelite is running
Service snap.microk8s.daemon-k8s-dqlite is running
Service snap.microk8s.daemon-apiserver-kicker is running
Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
Copy processes list to the final report tarball
Copy disk usage information to the final report tarball
Copy memory usage information to the final report tarball
Copy server uptime to the final report tarball
Copy openSSL information to the final report tarball
Copy snap list to the final report tarball
Copy VM name (or none) to the final report tarball
Copy current linux distribution to the final report tarball
Copy network configuration to the final report tarball
Inspecting kubernetes cluster
Inspect kubernetes cluster
Inspecting dqlite
Inspect dqlite
Can you suggest a fix?
Looks like this bug was fixed as per this PR NVIDIA/gpu-operator@6771549. Is the gpu addon picking up the changes from this PR?
Are you interested in contributing with a fix?
@ktsakalozos I assume you mean microk8s addons repo update core
. Even after doing that I still do not see nvidia-validator
:
$ microk8s kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator
No resources found in gpu-operator-resources namespace.
$ microk8s kubectl get pod -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
gpu-operator-node-feature-discovery-worker-cnfk6 1/1 Running 0 40m
gpu-operator-node-feature-discovery-master-65c9bd48c4-8pdpl 1/1 Running 0 40m
gpu-operator-b8cf946f6-tgbl7 1/1 Running 0 40m
Here is the output from snap info microk8s
:
$ snap info microk8s
name: microk8s
summary: Kubernetes for workstations and appliances
publisher: Canonical✓
store-url: https://snapcraft.io/microk8s
contact: https://github.com/ubuntu/microk8s
license: unset
description: |
MicroK8s is a small, fast, secure, single node Kubernetes that installs on
just about any Linux box. Use it for offline development, prototyping,
testing, or use it on a VM as a small, cheap, reliable k8s for CI/CD. It's
also a great k8s for appliances - develop your IoT apps for k8s and deploy
them to MicroK8s on your boxes.
commands:
...
services:
...
snap-id: EaXqgt1lyCaxKaQCU349mlodBkDCXRcg
tracking: 1.25/stable
refresh-date: 14 days ago, at 13:17 UTC
channels:
...
installed: v1.25.2 (4055) 174MB classic
I also attempted to disable and re-enable the gpu addon:
$ sudo microk8s disable gpu
Infer repository core for addon gpu
Addon core/gpu is already disabled
$ sudo microk8s enable gpu
Infer repository core for addon gpu
Enabling NVIDIA GPU
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
Using operator GPU driver
"nvidia" already exists with the same configuration, skipping
Error: INSTALLATION FAILED: cannot re-use a name that is still in use
NVIDIA is enabled
$ microk8s status
microk8s is running
high-availability: no
datastore master nodes: 127.0.0.1:19001
datastore standby nodes: none
addons:
enabled:
cert-manager # (core) Cloud native certificate management
community # (core) The community addons repository
dashboard # (core) The Kubernetes dashboard
dns # (core) CoreDNS
ha-cluster # (core) Configure high availability on the current node
helm # (core) Helm - the package manager for Kubernetes
helm3 # (core) Helm 3 - the package manager for Kubernetes
ingress # (core) Ingress controller for external access
metrics-server # (core) K8s Metrics Server for API access to service metrics
disabled:
argocd # (community) Argo CD is a declarative continuous deployment for Kubernetes.
cilium # (community) SDN, fast with full network policy
dashboard-ingress # (community) Ingress definition for Kubernetes dashboard
fluentd # (community) Elasticsearch-Fluentd-Kibana logging and monitoring
inaccel # (community) Simplifying FPGA management in Kubernetes
istio # (community) Core Istio service mesh services
jaeger # (community) Kubernetes Jaeger operator with its simple config
kata # (community) Kata Containers is a secure runtime with lightweight VMS
keda # (community) Kubernetes-based Event Driven Autoscaling
knative # (community) Knative Serverless and Event Driven Applications
linkerd # (community) Linkerd is a service mesh for Kubernetes and other frameworks
multus # (community) Multus CNI enables attaching multiple network interfaces to pods
nfs # (community) NFS Server Provisioner
openebs # (community) OpenEBS is the open-source storage solution for Kubernetes
openfaas # (community) OpenFaaS serverless framework
osm-edge # (community) osm-edge is a lightweight SMI compatible service mesh for the edge-computing.
portainer # (community) Portainer UI for your Kubernetes cluster
starboard # (community) Kubernetes-native security toolkit
traefik # (community) traefik Ingress controller for external access
gpu # (core) Automatic enablement of Nvidia CUDA
host-access # (core) Allow Pods connecting to Host services smoothly
hostpath-storage # (core) Storage class; allocates storage from host directory
kube-ovn # (core) An advanced network fabric for Kubernetes
mayastor # (core) OpenEBS MayaStor
metallb # (core) Loadbalancer for your Kubernetes cluster
observability # (core) A lightweight observability stack for logs, traces and metrics
prometheus # (core) Prometheus operator for monitoring and logging
rbac # (core) Role-Based Access Control for authorisation
registry # (core) Private image registry exposed on localhost:32000
storage # (core) Alias to hostpath-storage add-on, deprecated
My searching on how to forcefully remove an addon has not been successful...
Let me know if I can provide any additional information.
A quick workaround would be to do:
sudo sed 's,daemonset.apps/nvidia-device-plugin-daemonset,pod/gpu-operator,' -i /var/snap/microk8s/common/addons/core/addons.yaml
Then do microk8s disable gpu && microk8s enable gpu
Output of microk8s addons repo update core
Don't think it helped fix the issue
kartik@ninja01:~$ microk8s addons repo update core
Updating repository core
Traceback (most recent call last):
File "/snap/microk8s/4055/scripts/wrappers/addons.py", line 346, in <module>
addons(prog_name="microk8s addons")
File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/snap/microk8s/4055/scripts/wrappers/addons.py", line 245, in update
[GIT, "remote", "get-url", "origin"], cwd=repo_dir, stderr=subprocess.DEVNULL
File "/snap/microk8s/4055/usr/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/snap/microk8s/4055/usr/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/snap/microk8s/4055/git.wrapper', 'remote', 'get-url', 'origin']' returned non-zero exit status 128.
Hi @kartikra
This sounds like a bug. Can you try to see if microk8s addons repo add core /snap/microk8s/current/addons/core --force
does it?
Hello - I just got to try it recently. Was not able to test sooner. Looks like microk8s addons repo add core /snap/microk8s/current/addons/core --force
works. Able to see gpu validations now
@neoaggelos your sed command kicked out: sed: -e expression #1, char 32: unknown option to 's'
sed --version
sed (GNU sed) 4.8
Packaged by Debian
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Jay Fenlason, Tom Lord, Ken Pizzini,
Paolo Bonzini, Jim Meyering, and Assaf Gordon.
This sed program was built with SELinux support.
SELinux is disabled on this system.
GNU sed home page: <https://www.gnu.org/software/sed/>.
General help using GNU software: <https://www.gnu.org/gethelp/>.
E-mail bug reports to: <bug-sed@gnu.org>.
I also attempted microk8s addons repo add core /snap/microk8s/current/addons/core --force
followed by
microk8s disable gpu && microk8s enable gpu
Infer repository core for addon gpu
Addon core/gpu is already disabled
Infer repository core for addon gpu
Enabling NVIDIA GPU
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
Using operator GPU driver
"nvidia" already exists with the same configuration, skipping
Error: INSTALLATION FAILED: failed to download "nvidia/gpu-operator" at version "v22.9.0"
NVIDIA is enabled
There are still no gpu resources in the microk8s kubectl describe node host
report.
Let me know if I can do anything to provide any more information.
Apologies, the command should instead be
sudo sed 's,daemonset.apps/nvidia-device-plugin-daemonset,pod/gpu-operator,' -i /var/snap/microk8s/common/addons/core/addons.yaml
I've updated the comment as well. Can you then do:
microk8s disable gpu
# check if any instance of gpu-operator is still on the list
microk8s helm ls -A
# if any, remove it
microk8s helm uninstall $name -n $namespace
# for good measure, reboot the system
sudo reboot
# enable GPU
microk8s enable gpu
The sed command returned with no error this time.
einstine909@host:~$ sudo sed 's,daemonset.apps/nvidia-device-plugin-daemonset,pod/gpu-operator,' -i /var/snap/microk8s/common/addons/core/addons.yaml
einstine909@host:~$ microk8s disable gpu
Infer repository core for addon gpu
Addon core/gpu is already disabled
einstine909@host:~$ microk8s helm ls -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
Followed by a reboot.
einstine909@host:~$ microk8s enable gpu
Infer repository core for addon gpu
Enabling NVIDIA GPU
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
Using operator GPU driver
"nvidia" already exists with the same configuration, skipping
Error: INSTALLATION FAILED: failed to download "nvidia/gpu-operator" at version "v22.9.0"
NVIDIA is enabled
einstine909@host:~$ microk8s kubectl get pod -n gpu-operator-resources
No resources found in gpu-operator-resources namespace.
Unfortunately it looks like that it was unsuccessful.
If this is of any help, here is my GPU (listed on the NVIDIA Operator Documentation as compatible):
einstine909@host:~$ lspci
...
05:00.0 VGA compatible controller: NVIDIA Corporation GA104GL [RTX A4000] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
...
Thank you for your help!
I wonder if you also need a
help repo update
before running "microk8s enable gpu"
Due to already having the nvidia repo (from an older point in time), perhaps it needs to be updated.
I assume you mean microk8s helm repo update
einstine909@host:~$ microk8s enable gpu
Infer repository core for addon gpu
Enabling NVIDIA GPU
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
Using operator GPU driver
"nvidia" already exists with the same configuration, skipping
W1124 08:50:41.781887 1548481 warnings.go:70] unknown field "spec.dcgmExporter.enabled"
W1124 08:50:41.781940 1548481 warnings.go:70] unknown field "spec.dcgmExporter.serviceMonitor"
W1124 08:50:41.781954 1548481 warnings.go:70] unknown field "spec.devicePlugin.enabled"
W1124 08:50:41.781965 1548481 warnings.go:70] unknown field "spec.driver.rollingUpdate"
W1124 08:50:41.781978 1548481 warnings.go:70] unknown field "spec.gfd.enabled"
W1124 08:50:41.781990 1548481 warnings.go:70] unknown field "spec.toolkit.installDir"
NAME: gpu-operator
LAST DEPLOYED: Thu Nov 24 08:50:36 2022
NAMESPACE: gpu-operator-resources
STATUS: deployed
REVISION: 1
TEST SUITE: None
NVIDIA is enabled
After a bit:
Name: host
Roles: <none>
Labels:
...
nvidia.com/cuda.driver.major=515
nvidia.com/cuda.driver.minor=65
nvidia.com/cuda.driver.rev=01
nvidia.com/cuda.runtime.major=11
nvidia.com/cuda.runtime.minor=7
nvidia.com/gfd.timestamp=1669305326
nvidia.com/gpu.compute.major=8
nvidia.com/gpu.compute.minor=6
nvidia.com/gpu.count=1
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=true
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.node-status-exporter=true
nvidia.com/gpu.deploy.nvsm=
nvidia.com/gpu.deploy.operator-validator=true
nvidia.com/gpu.family=ampere
nvidia.com/gpu.machine=PowerEdge-R720
nvidia.com/gpu.memory=16376
nvidia.com/gpu.present=true
nvidia.com/gpu.product=NVIDIA-RTX-A4000
nvidia.com/gpu.replicas=1
nvidia.com/mig.capable=false
nvidia.com/mig.strategy=single
...
...
Capacity:
...
nvidia.com/gpu: 1
pods: 110
Allocatable:
...
nvidia.com/gpu: 1
...
...
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
...
nvidia.com/gpu 0 0
Events: <none>
It seems to have worked! Thank you for your help.
If the gpu add-on relies on a helm repo, it might be a good idea for the gpu add-on to update that repo to a known version before attempting to utilize the repo to install the gpu operator.
Yes, indeed, sorry for the typo, I can only remember so many commands without a terminal next to me.
Great that it worked, indeed, updating the helm repository before installing would be a good idea to prevent these sort of issues.