GPU device plugin deployment issue (non default namespace)

Question

GPU device plugin deployment issue (non default namespace)

pawel-gacek opened this issue 3 months ago · 4 comments

Describe the bug
GPU device plugin will not work properly once NOT installed in default namespace. For the ClusterRoleBinding resource the ServiceAccount namespace is set to "default" once installed using kustomization tool regardless of namespace configured/used during GPU device plugin deployment:
https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/deployments/gpu_plugin/overlays/fractional_resources/gpu-manager-rolebinding.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: gpu-manager-rolebinding
subjects:

kind: ServiceAccount
name: gpu-manager-sa
namespace: default ->>> here
roleRef:
kind: ClusterRole
name: gpu-manager-role
apiGroup: rbac.authorization.k8s.io

To Reproduce
Install GPU device plugin in non default namespace with kustomization.

Expected behavior
For ClusterRoleBinding resource (name gpu-manager-rolebinding) the ServiceAccount namespace is set to desired namespace.

System (please complete the following information):

OS version: Ubuntu 22.04
Kernel version: Linux 5.15
Device plugins version: v0.30.0
Hardware info:
Xeon 8360Y,
System Information
Manufacturer: Intel Corporation
Product Name: M50CYP2SBSTD
Version: M50CYP2UR208

Thank you
Pawel

Answer 1 · 2024-09-25T08:18:00.000Z

Hi @pawel-gacek, yep, you are correct. This is a limitation of the deployment. We can't change the namespace name within the yaml file. The namespace is handled properly in our operator based deployment, though.

Answer 2 · 2024-09-25T09:07:49.000Z

hi @tkatila got it thanks. Cause it may cause some issues in plugin operation as deployment itself works fine. Would be good if such limitation can be documented somewhere as I believe there are still kustomization based deployments in use. In our case we simply have not noticed that GPU plugin did not work properly until we have seen the GPU resource allocation failure for one of our workload.

Answer 3 · 2024-09-25T09:59:49.000Z

Sure. I'll add a note about it to the advanced deployments docs.

Off-topic: fractional resources is a sort of niche use case, how are you using it?

Answer 4 · 2024-09-25T11:12:33.000Z

We do use GPU Aware Scheduler extender that requires fractional resources to be enabled with GPU dev plugin.