gpu add on failing on 1.25

Question

gpu add on failing on 1.25

Closed this issue 2 years ago · 12 comments

kartikra commented 2 years ago

Summary

GPU add on is not working as expected in kubernetes 1.25

Process

[List the steps to replicate the issue.]

Install microk8s 1.25 sudo snap install microk8s --classic --channel=1.25/stable
Enable other plugins microk8s enable rbac hostpath-storage metallb ingress dns dashboard helm
Enable gpu microk8s enable gpu

After some time run microk8s kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator. Looks like nvidia-validator is not installed.

Screenshot

[If relevant, include a screenshot.]
1.6654307708338156e+09 ERROR controller.clusterpolicy-controller Reconciler error {"name": "cluster-policy", "namespace": "", "error": "no matches for kind "RuntimeClass" in version "node.k8s.io/v1beta1""}

Browser details

[Optionally - if you can, copy the report generated by mybrowser.fyi - this might help us debug certain types of issues.]

Introspection Report

Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting dqlite
  Inspect dqlite

Can you suggest a fix?

Looks like this bug was fixed as per this PR NVIDIA/gpu-operator@6771549. Is the gpu addon picking up the changes from this PR?

Are you interested in contributing with a fix?

Answer 1 · 2022-10-13T17:24:37.000Z

Hi @kartikra I believe we have the fix for this already released on the latest 1.25 snap revision. Could you do a microk8s addons tepo update core and see if the problem is fixed? Could you also share the revision of the snap you have installed (with snap info microk8s)? Thank you

#110

Answer 2 · 2022-10-13T20:05:04.000Z

@ktsakalozos I assume you mean microk8s addons repo update core. Even after doing that I still do not see nvidia-validator:

$ microk8s kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator
No resources found in gpu-operator-resources namespace.

$ microk8s kubectl get pod -n gpu-operator-resources
NAME                                                          READY   STATUS    RESTARTS   AGE
gpu-operator-node-feature-discovery-worker-cnfk6              1/1     Running   0          40m
gpu-operator-node-feature-discovery-master-65c9bd48c4-8pdpl   1/1     Running   0          40m
gpu-operator-b8cf946f6-tgbl7                                  1/1     Running   0          40m

Here is the output from snap info microk8s:

$ snap info microk8s
name:      microk8s
summary:   Kubernetes for workstations and appliances
publisher: Canonical✓
store-url: https://snapcraft.io/microk8s
contact:   https://github.com/ubuntu/microk8s
license:   unset
description: |
  MicroK8s is a small, fast, secure, single node Kubernetes that installs on
  just about any Linux box. Use it for offline development, prototyping,
  testing, or use it on a VM as a small, cheap, reliable k8s for CI/CD. It's
  also a great k8s for appliances - develop your IoT apps for k8s and deploy
  them to MicroK8s on your boxes.
commands:
...
services:
...
snap-id:      EaXqgt1lyCaxKaQCU349mlodBkDCXRcg
tracking:     1.25/stable
refresh-date: 14 days ago, at 13:17 UTC
channels:
...                                
installed:               v1.25.2                    (4055) 174MB classic

I also attempted to disable and re-enable the gpu addon:

$ sudo microk8s disable gpu
Infer repository core for addon gpu
Addon core/gpu is already disabled

$ sudo microk8s enable gpu
Infer repository core for addon gpu
Enabling NVIDIA GPU
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
Using operator GPU driver
"nvidia" already exists with the same configuration, skipping
Error: INSTALLATION FAILED: cannot re-use a name that is still in use
NVIDIA is enabled

$ microk8s status
microk8s is running
high-availability: no
  datastore master nodes: 127.0.0.1:19001
  datastore standby nodes: none
addons:
  enabled:
    cert-manager         # (core) Cloud native certificate management
    community            # (core) The community addons repository
    dashboard            # (core) The Kubernetes dashboard
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
    ingress              # (core) Ingress controller for external access
    metrics-server       # (core) K8s Metrics Server for API access to service metrics
  disabled:
    argocd               # (community) Argo CD is a declarative continuous deployment for Kubernetes.
    cilium               # (community) SDN, fast with full network policy
    dashboard-ingress    # (community) Ingress definition for Kubernetes dashboard
    fluentd              # (community) Elasticsearch-Fluentd-Kibana logging and monitoring
    inaccel              # (community) Simplifying FPGA management in Kubernetes
    istio                # (community) Core Istio service mesh services
    jaeger               # (community) Kubernetes Jaeger operator with its simple config
    kata                 # (community) Kata Containers is a secure runtime with lightweight VMS
    keda                 # (community) Kubernetes-based Event Driven Autoscaling
    knative              # (community) Knative Serverless and Event Driven Applications
    linkerd              # (community) Linkerd is a service mesh for Kubernetes and other frameworks
    multus               # (community) Multus CNI enables attaching multiple network interfaces to pods
    nfs                  # (community) NFS Server Provisioner
    openebs              # (community) OpenEBS is the open-source storage solution for Kubernetes
    openfaas             # (community) OpenFaaS serverless framework
    osm-edge             # (community) osm-edge is a lightweight SMI compatible service mesh for the edge-computing.
    portainer            # (community) Portainer UI for your Kubernetes cluster
    starboard            # (community) Kubernetes-native security toolkit
    traefik              # (community) traefik Ingress controller for external access
    gpu                  # (core) Automatic enablement of Nvidia CUDA
    host-access          # (core) Allow Pods connecting to Host services smoothly
    hostpath-storage     # (core) Storage class; allocates storage from host directory
    kube-ovn             # (core) An advanced network fabric for Kubernetes
    mayastor             # (core) OpenEBS MayaStor
    metallb              # (core) Loadbalancer for your Kubernetes cluster
    observability        # (core) A lightweight observability stack for logs, traces and metrics
    prometheus           # (core) Prometheus operator for monitoring and logging
    rbac                 # (core) Role-Based Access Control for authorisation
    registry             # (core) Private image registry exposed on localhost:32000
    storage              # (core) Alias to hostpath-storage add-on, deprecated

My searching on how to forcefully remove an addon has not been successful...

Let me know if I can provide any additional information.

Answer 3 · 2022-10-14T17:25:18.000Z

A quick workaround would be to do:

sudo sed  's,daemonset.apps/nvidia-device-plugin-daemonset,pod/gpu-operator,' -i /var/snap/microk8s/common/addons/core/addons.yaml

Then do microk8s disable gpu && microk8s enable gpu

Answer 4 · 2022-10-21T00:56:43.000Z

Output of microk8s addons repo update core Don't think it helped fix the issue

kartik@ninja01:~$ microk8s addons repo update core
Updating repository core
Traceback (most recent call last):
  File "/snap/microk8s/4055/scripts/wrappers/addons.py", line 346, in <module>
    addons(prog_name="microk8s addons")
  File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/snap/microk8s/4055/usr/lib/python3/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/snap/microk8s/4055/scripts/wrappers/addons.py", line 245, in update
    [GIT, "remote", "get-url", "origin"], cwd=repo_dir, stderr=subprocess.DEVNULL
  File "/snap/microk8s/4055/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/snap/microk8s/4055/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/snap/microk8s/4055/git.wrapper', 'remote', 'get-url', 'origin']' returned non-zero exit status 128.

Answer 5 · 2022-10-21T06:57:34.000Z

Hi @kartikra

This sounds like a bug. Can you try to see if microk8s addons repo add core /snap/microk8s/current/addons/core --force does it?

Answer 6 · 2022-11-21T01:10:22.000Z

Hello - I just got to try it recently. Was not able to test sooner. Looks like microk8s addons repo add core /snap/microk8s/current/addons/core --force works. Able to see gpu validations now

Answer 7 · 2022-11-23T00:04:38.000Z

@neoaggelos your sed command kicked out: sed: -e expression #1, char 32: unknown option to 's'

sed --version
sed (GNU sed) 4.8
Packaged by Debian
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Jay Fenlason, Tom Lord, Ken Pizzini,
Paolo Bonzini, Jim Meyering, and Assaf Gordon.

This sed program was built with SELinux support.
SELinux is disabled on this system.

GNU sed home page: <https://www.gnu.org/software/sed/>.
General help using GNU software: <https://www.gnu.org/gethelp/>.
E-mail bug reports to: <bug-sed@gnu.org>.

I also attempted microk8s addons repo add core /snap/microk8s/current/addons/core --force followed by

microk8s disable gpu && microk8s enable gpu
Infer repository core for addon gpu
Addon core/gpu is already disabled
Infer repository core for addon gpu
Enabling NVIDIA GPU
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
Using operator GPU driver
"nvidia" already exists with the same configuration, skipping
Error: INSTALLATION FAILED: failed to download "nvidia/gpu-operator" at version "v22.9.0"
NVIDIA is enabled

There are still no gpu resources in the microk8s kubectl describe node host report.

Let me know if I can do anything to provide any more information.

Answer 8 · 2022-11-23T11:45:19.000Z

Apologies, the command should instead be

sudo sed  's,daemonset.apps/nvidia-device-plugin-daemonset,pod/gpu-operator,' -i /var/snap/microk8s/common/addons/core/addons.yaml

I've updated the comment as well. Can you then do:

microk8s disable gpu

# check if any instance of gpu-operator is still on the list
microk8s helm ls -A
# if any, remove it
microk8s helm uninstall $name -n $namespace

# for good measure, reboot the system
sudo reboot

# enable GPU
microk8s enable gpu

Answer 9 · 2022-11-23T17:17:52.000Z

The sed command returned with no error this time.

einstine909@host:~$ sudo sed  's,daemonset.apps/nvidia-device-plugin-daemonset,pod/gpu-operator,' -i /var/snap/microk8s/common/addons/core/addons.yaml
einstine909@host:~$ microk8s disable gpu
Infer repository core for addon gpu
Addon core/gpu is already disabled
einstine909@host:~$ microk8s helm ls -A
NAME	NAMESPACE	REVISION	UPDATED	STATUS	CHART	APP VERSION

Followed by a reboot.

einstine909@host:~$ microk8s enable gpu
Infer repository core for addon gpu
Enabling NVIDIA GPU
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
Using operator GPU driver
"nvidia" already exists with the same configuration, skipping
Error: INSTALLATION FAILED: failed to download "nvidia/gpu-operator" at version "v22.9.0"
NVIDIA is enabled

einstine909@host:~$ microk8s kubectl get pod -n gpu-operator-resources
No resources found in gpu-operator-resources namespace.

Unfortunately it looks like that it was unsuccessful.

If this is of any help, here is my GPU (listed on the NVIDIA Operator Documentation as compatible):

einstine909@host:~$ lspci 
...
05:00.0 VGA compatible controller: NVIDIA Corporation GA104GL [RTX A4000] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
...

Thank you for your help!

Answer 10 · 2022-11-24T06:22:29.000Z

I wonder if you also need a

help repo update

before running "microk8s enable gpu"

Due to already having the nvidia repo (from an older point in time), perhaps it needs to be updated.

Answer 11 · 2022-11-24T16:34:17.000Z

I assume you mean microk8s helm repo update

einstine909@host:~$ microk8s enable gpu
Infer repository core for addon gpu
Enabling NVIDIA GPU
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Checking if NVIDIA driver is already installed
Using operator GPU driver
"nvidia" already exists with the same configuration, skipping
W1124 08:50:41.781887 1548481 warnings.go:70] unknown field "spec.dcgmExporter.enabled"
W1124 08:50:41.781940 1548481 warnings.go:70] unknown field "spec.dcgmExporter.serviceMonitor"
W1124 08:50:41.781954 1548481 warnings.go:70] unknown field "spec.devicePlugin.enabled"
W1124 08:50:41.781965 1548481 warnings.go:70] unknown field "spec.driver.rollingUpdate"
W1124 08:50:41.781978 1548481 warnings.go:70] unknown field "spec.gfd.enabled"
W1124 08:50:41.781990 1548481 warnings.go:70] unknown field "spec.toolkit.installDir"
NAME: gpu-operator
LAST DEPLOYED: Thu Nov 24 08:50:36 2022
NAMESPACE: gpu-operator-resources
STATUS: deployed
REVISION: 1
TEST SUITE: None
NVIDIA is enabled

After a bit:

Name:               host
Roles:              <none>
Labels: 
                    ...
                    nvidia.com/cuda.driver.major=515
                    nvidia.com/cuda.driver.minor=65
                    nvidia.com/cuda.driver.rev=01
                    nvidia.com/cuda.runtime.major=11
                    nvidia.com/cuda.runtime.minor=7
                    nvidia.com/gfd.timestamp=1669305326
                    nvidia.com/gpu.compute.major=8
                    nvidia.com/gpu.compute.minor=6
                    nvidia.com/gpu.count=1
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=true
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.nvsm=
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=ampere
                    nvidia.com/gpu.machine=PowerEdge-R720
                    nvidia.com/gpu.memory=16376
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-RTX-A4000
                    nvidia.com/gpu.replicas=1
                    nvidia.com/mig.capable=false
                    nvidia.com/mig.strategy=single
                    ...
...
Capacity:
  ...
  nvidia.com/gpu:     1
  pods:               110
Allocatable:
  ...
  nvidia.com/gpu:     1
  ...
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
...
  nvidia.com/gpu     0           0
Events:              <none>

It seems to have worked! Thank you for your help.

If the gpu add-on relies on a helm repo, it might be a good idea for the gpu add-on to update that repo to a known version before attempting to utilize the repo to install the gpu operator.

Answer 12 · 2022-11-24T17:22:37.000Z

Yes, indeed, sorry for the typo, I can only remember so many commands without a terminal next to me.

Great that it worked, indeed, updating the helm repository before installing would be a good idea to prevent these sort of issues.