Allocation mode Immediate does not work
asm582 opened this issue · 10 comments
We are trying to use allocation mode Immediate
but it does not work, we see claims created:
Name: gpu.example.com
Namespace: gpu-test1
Labels: <none>
Annotations: <none>
API Version: resource.k8s.io/v1alpha2
Kind: ResourceClaimTemplate
Metadata:
Creation Timestamp: 2023-11-16T16:34:00Z
Resource Version: 4614
UID: 0d57888b-bdce-4633-8326-dc81898f6f43
Spec:
Metadata:
Creation Timestamp: <nil>
Spec:
Allocation Mode: Immediate
Resource Class Name: gpu.example.com
but claims are not generated on the node:
Name: dra-example-driver-cluster-worker
Namespace: dra-example-driver
Labels: <none>
Annotations: <none>
API Version: nas.gpu.resource.example.com/v1alpha1
Kind: NodeAllocationState
Metadata:
Creation Timestamp: 2023-11-16T15:54:37Z
Generation: 79
Owner References:
API Version: v1
Kind: Node
Name: dra-example-driver-cluster-worker
UID: 0d210c9c-da1b-4fad-afca-e7369d6a5851
Resource Version: 15633
UID: 1fd486d6-e754-47dc-bb4c-0392f61b3c05
Spec:
Allocatable Devices:
Gpu:
Product Name: LATEST-GPU-MODEL
Uuid: GPU-e7b42cb1-4fd8-91b2-bc77-352a0c1f5747
Gpu:
Product Name: LATEST-GPU-MODEL
Uuid: GPU-f11773a1-5bfb-e48b-3d98-1beb5baaf08e
Gpu:
Product Name: LATEST-GPU-MODEL
Uuid: GPU-0159f35e-99ee-b2b5-74f1-9d18df3f22ac
Gpu:
Product Name: LATEST-GPU-MODEL
Uuid: GPU-657bd2e7-f5c2-a7f2-fbaa-0d1cdc32f81b
Gpu:
Product Name: LATEST-GPU-MODEL
Uuid: GPU-18db0e85-99e9-c746-8531-ffeb86328b39
Gpu:
Product Name: LATEST-GPU-MODEL
Uuid: GPU-93d37703-997c-c46f-a531-755e3e0dc2ac
Gpu:
Product Name: LATEST-GPU-MODEL
Uuid: GPU-ee3e4b55-fcda-44b8-0605-64b7a9967744
Gpu:
Product Name: LATEST-GPU-MODEL
Uuid: GPU-9ede7e32-5825-a11b-fa3d-bab6d47e0243
Status: Ready
Events: <none>
The resource class is wrong: gpu.example.com
it should be gpu.nvidia.com
The resource class is wrong:
gpu.example.com
it should begpu.nvidia.com
Thanks @klueska as seen I am running an example driver with simulated GPUs. are you saying immediate mode only works with real GPUs?
The resource class is wrong:
gpu.example.com
it should begpu.nvidia.com
Thanks @klueska as seen I am running an example driver with simulated GPUs. are you saying immediate mode only works with real GPUs?
Are you refering to the https://github.com/kubernetes-sigs/dra-example-driver? If so, we should migrate this issue there instead. This repository is for the NVIDIA GPU-specific DRA driver implementation.
Support is not yet merged for it in the example driver. See kubernetes-sigs/dra-example-driver#4
In any case, I got confused because (as Evan said) you opened the issue against this repo, rather than the example driver repo (so i assumed you were using the NVIDIA DRA driver rather than the example one).
Sorry for the confusion, the reason I raised the issue here is that I saw this logline:
If we think immediate mode works I can certainly move the issue to the desired repository, thanks
Hello, we tried this on real nodes and got the below status when exercising claims in Immediate mode :
[root@nvd-srv-02 k8s-dra-driver]# kubectl describe resourceclaim gpu.nvidia.com -n gpu-test1
Name: gpu.nvidia.com
Namespace: gpu-test1
Labels: <none>
Annotations: <none>
API Version: resource.k8s.io/v1alpha2
Kind: ResourceClaim
Metadata:
Creation Timestamp: 2023-11-29T17:54:02Z
Finalizers:
gpu.resource.nvidia.com/deletion-protection
Resource Version: 7898
UID: 066b4c8f-a174-45eb-a1b7-9b4ad78a0f17
Spec:
Allocation Mode: Immediate
Resource Class Name: gpu.nvidia.com
Status:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Failed 21s (x14 over 62s) resource driver gpu.resource.nvidia.com allocate: TODO: immediate allocations not yet supported
could you please share what we are missing?
You aren't missing anything:
allocate: TODO: immediate allocations not yet supported
We haven't added support for immediate mode yet
Thanks, Do we know when will immediate mode be supported in Nvidia's DRA driver implementation?
Ping! Can we request a roadmap for features that are planned for Nvidia's DRA implementation, for our use case we see Allocation mode as an important feature.
There is no concrete roadmap at the moment. Rapid development on this driver has been paused due to the issues that have come up with getting DRA promoted to beta
upstream. All efforts have been shifted to ensuring this happens in as timely a manner as possible. We will, of course, continue to develop this driver, but it is more important to ensure that DRA happens at all, than to keep adding features here.