
gpu add on failing on 1.25

GPU add on is not working as expected in kubernetes 1.25


  1. Install microk8s 1.25 sudo snap install microk8s --classic --channel=1.25/stable
  2. Enable other plugins microk8s enable rbac hostpath-storage metallb ingress dns dashboard helm
  3. Enable gpu microk8s enable gpu

After some time run microk8s kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator. Looks like nvidia-validator is not installed.


1.6654307708338156e+09 ERROR controller.clusterpolicy-controller Reconciler error {"name": "cluster-policy", "namespace": "", "error": "no matches for kind "RuntimeClass" in version """}

Browser details

Can you suggest a fix?

Looks like this bug was fixed as per this PR NVIDIA/gpu-operator@6771549. Is the gpu addon picking up the changes from this PR?

Are you interested in contributing with a fix?

Hi @kartikra I believe we have the fix for this already released on the latest 1.25 snap revision. Could you do a microk8s addons tepo update core and see if the problem is fixed? Could you also share the revision of the snap you have installed (with snap info microk8s)? Thank you


@ktsakalozos I assume you mean microk8s addons repo update core. Even after doing that I still do not see nvidia-validator:

$ microk8s kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator
No resources found in gpu-operator-resources namespace.

$ microk8s kubectl get pod -n gpu-operator-resources
NAME                                                          READY   STATUS    RESTARTS   AGE
gpu-operator-node-feature-discovery-worker-cnfk6              1/1     Running   0          40m
gpu-operator-node-feature-discovery-master-65c9bd48c4-8pdpl   1/1     Running   0          40m
gpu-operator-b8cf946f6-tgbl7                                  1/1     Running   0          40m

Here is the output from snap info microk8s:

$ snap info microk8s
name:      microk8s
summary:   Kubernetes for workstations and appliances
publisher: Canonical✓
license:   unset
description: |
  MicroK8s is a small, fast, secure, single node Kubernetes that installs on
  just about any Linux box. Use it for offline development, prototyping,
  testing, or use it on a VM as a small, cheap, reliable k8s for CI/CD. It's
  also a great k8s for appliances - develop your IoT apps for k8s and deploy
  them to MicroK8s on your boxes.
snap-id:      EaXqgt1lyCaxKaQCU349mlodBkDCXRcg
tracking:     1.25/stable
refresh-date: 14 days ago, at 13:17 UTC
installed:               v1.25.2                    (4055) 174MB classic

I also attempted to disable and re-enable the gpu addon:

$ sudo microk8s disable gpu
Infer repository core for addon gpu
Addon core/gpu is already disabled

$ sudo microk8s enable gpu
Infer repository core for addon gpu
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
Using operator GPU driver
"nvidia" already exists with the same configuration, skipping
Error: INSTALLATION FAILED: cannot re-use a name that is still in use
NVIDIA is enabled

$ microk8s status
microk8s is running
high-availability: no
  datastore master nodes:
  datastore standby nodes: none
My searching on how to forcefully remove an addon has not been successful...

Let me know if I can provide any additional information.

A quick workaround would be to do:

sudo sed  's,daemonset.apps/nvidia-device-plugin-daemonset,pod/gpu-operator,' -i /var/snap/microk8s/common/addons/core/addons.yaml

Then do microk8s disable gpu && microk8s enable gpu

Output of microk8s addons repo update core Don't think it helped fix the issue

kartik@ninja01:~$ microk8s addons repo update core
Updating repository core
Traceback (most recent call last):
  File "/snap/microk8s/4055/scripts/wrappers/", line 346, in <module>
    addons(prog_name="microk8s addons")
  File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/", line 697, in main
    rv = self.invoke(ctx)
  File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/", line 535, in invoke
    return callback(*args, **kwargs)
  File "/snap/microk8s/4055/scripts/wrappers/", line 245, in update
    [GIT, "remote", "get-url", "origin"], cwd=repo_dir, stderr=subprocess.DEVNULL
  File "/snap/microk8s/4055/usr/lib/python3.6/", line 356, in check_output
  File "/snap/microk8s/4055/usr/lib/python3.6/", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/snap/microk8s/4055/git.wrapper', 'remote', 'get-url', 'origin']' returned non-zero exit status 128.

Hi @kartikra

This sounds like a bug. Can you try to see if microk8s addons repo add core /snap/microk8s/current/addons/core --force does it?

Hello - I just got to try it recently. Was not able to test sooner. Looks like microk8s addons repo add core /snap/microk8s/current/addons/core --force works. Able to see gpu validations now

@neoaggelos your sed command kicked out: sed: -e expression #1, char 32: unknown option to 's'

sed --version
sed (GNU sed) 4.8
Packaged by Debian
I also attempted microk8s addons repo add core /snap/microk8s/current/addons/core --force followed by

microk8s disable gpu && microk8s enable gpu
Infer repository core for addon gpu
Addon core/gpu is already disabled
Infer repository core for addon gpu
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
Using operator GPU driver
"nvidia" already exists with the same configuration, skipping
Error: INSTALLATION FAILED: failed to download "nvidia/gpu-operator" at version "v22.9.0"
NVIDIA is enabled

There are still no gpu resources in the microk8s kubectl describe node host report.

Let me know if I can do anything to provide any more information.

Apologies, the command should instead be

sudo sed  's,daemonset.apps/nvidia-device-plugin-daemonset,pod/gpu-operator,' -i /var/snap/microk8s/common/addons/core/addons.yaml

I've updated the comment as well. Can you then do:

microk8s disable gpu

# check if any instance of gpu-operator is still on the list
microk8s helm ls -A
# if any, remove it
microk8s helm uninstall $name -n $namespace

# for good measure, reboot the system
sudo reboot

# enable GPU
microk8s enable gpu

The sed command returned with no error this time.

einstine909@host:~$ sudo sed  's,daemonset.apps/nvidia-device-plugin-daemonset,pod/gpu-operator,' -i /var/snap/microk8s/common/addons/core/addons.yaml
einstine909@host:~$ microk8s disable gpu
Infer repository core for addon gpu
Addon core/gpu is already disabled
einstine909@host:~$ microk8s helm ls -A

Followed by a reboot.

einstine909@host:~$ microk8s enable gpu
Infer repository core for addon gpu
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
Using operator GPU driver
"nvidia" already exists with the same configuration, skipping
Error: INSTALLATION FAILED: failed to download "nvidia/gpu-operator" at version "v22.9.0"
NVIDIA is enabled

einstine909@host:~$ microk8s kubectl get pod -n gpu-operator-resources
No resources found in gpu-operator-resources namespace.

Unfortunately it looks like that it was unsuccessful.

If this is of any help, here is my GPU (listed on the NVIDIA Operator Documentation as compatible):

einstine909@host:~$ lspci 
05:00.0 VGA compatible controller: NVIDIA Corporation GA104GL [RTX A4000] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)

Thank you for your help!

I wonder if you also need a

help repo update

before running "microk8s enable gpu"

Due to already having the nvidia repo (from an older point in time), perhaps it needs to be updated.

I assume you mean microk8s helm repo update

einstine909@host:~$ microk8s enable gpu
Infer repository core for addon gpu
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
Using operator GPU driver
"nvidia" already exists with the same configuration, skipping
W1124 08:50:41.781887 1548481 warnings.go:70] unknown field "spec.dcgmExporter.enabled"
W1124 08:50:41.781940 1548481 warnings.go:70] unknown field "spec.dcgmExporter.serviceMonitor"
W1124 08:50:41.781954 1548481 warnings.go:70] unknown field "spec.devicePlugin.enabled"
W1124 08:50:41.781965 1548481 warnings.go:70] unknown field "spec.driver.rollingUpdate"
W1124 08:50:41.781978 1548481 warnings.go:70] unknown field "spec.gfd.enabled"
W1124 08:50:41.781990 1548481 warnings.go:70] unknown field "spec.toolkit.installDir"
NAME: gpu-operator
LAST DEPLOYED: Thu Nov 24 08:50:36 2022
NAMESPACE: gpu-operator-resources
STATUS: deployed
NVIDIA is enabled

After a bit:

Name:               host
Roles:              <none>
  ...     1
  pods:               110
  ...     1
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
...     0           0
Events:              <none>

It seems to have worked! Thank you for your help.

If the gpu add-on relies on a helm repo, it might be a good idea for the gpu add-on to update that repo to a known version before attempting to utilize the repo to install the gpu operator.

Yes, indeed, sorry for the typo, I can only remember so many commands without a terminal next to me.

Great that it worked, indeed, updating the helm repository before installing would be a good idea to prevent these sort of issues.