AliyunContainerService/gpushare-device-plugin

插件能获取GPU的个数,但是获取不了GPU的显存,共享无法调度

Opened this issue · 2 comments

Capacity:
aliyun.com/gpu-count: 8
aliyun.com/gpu-mem: 0
gpu tesla V100

日志如下
[root@localhost ~]# kubectl logs -f -n kube-system gpushare-device-plugin-ds-qjltc
I1012 05:08:46.374978 1 main.go:18] Start gpushare device plugin
I1012 05:08:46.375045 1 gpumanager.go:28] Loading NVML
I1012 05:08:46.379478 1 gpumanager.go:37] Fetching devices.
I1012 05:08:46.379497 1 gpumanager.go:43] Starting FS watcher.
I1012 05:08:46.379930 1 gpumanager.go:51] Starting OS watcher.
I1012 05:08:46.389438 1 nvidia.go:64] Deivce GPU-60805828-8ab0-6124-67c4-9baff56d087b's Path is /dev/nvidia0
I1012 05:08:46.389549 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.389564 1 nvidia.go:40] set gpu memory: 32510
I1012 05:08:46.389577 1 nvidia.go:76] # Add first device ID: GPU-60805828-8ab0-6124-67c4-9baff56d087b--0
I1012 05:08:46.453844 1 nvidia.go:79] # Add last device ID: GPU-60805828-8ab0-6124-67c4-9baff56d087b--32509
I1012 05:08:46.461774 1 nvidia.go:64] Deivce GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01's Path is /dev/nvidia1
I1012 05:08:46.461816 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.461827 1 nvidia.go:76] # Add first device ID: GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01--0
I1012 05:08:46.559867 1 nvidia.go:79] # Add last device ID: GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01--32509
I1012 05:08:46.567541 1 nvidia.go:64] Deivce GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a's Path is /dev/nvidia2
I1012 05:08:46.567574 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.567583 1 nvidia.go:76] # Add first device ID: GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a--0
I1012 05:08:46.658328 1 nvidia.go:79] # Add last device ID: GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a--32509
I1012 05:08:46.666367 1 nvidia.go:64] Deivce GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5's Path is /dev/nvidia3
I1012 05:08:46.666393 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.666399 1 nvidia.go:76] # Add first device ID: GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5--0
I1012 05:08:46.676851 1 nvidia.go:79] # Add last device ID: GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5--32509
I1012 05:08:46.683786 1 nvidia.go:64] Deivce GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991's Path is /dev/nvidia4
I1012 05:08:46.683802 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.683809 1 nvidia.go:76] # Add first device ID: GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991--0
I1012 05:08:46.948055 1 nvidia.go:79] # Add last device ID: GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991--32509
I1012 05:08:46.956435 1 nvidia.go:64] Deivce GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf's Path is /dev/nvidia5
I1012 05:08:46.956486 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.956504 1 nvidia.go:76] # Add first device ID: GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf--0
I1012 05:08:46.972438 1 nvidia.go:79] # Add last device ID: GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf--32509
I1012 05:08:46.980775 1 nvidia.go:64] Deivce GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415's Path is /dev/nvidia6
I1012 05:08:46.980797 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.980805 1 nvidia.go:76] # Add first device ID: GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415--0
I1012 05:08:46.990545 1 nvidia.go:79] # Add last device ID: GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415--32509
I1012 05:08:46.997877 1 nvidia.go:64] Deivce GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2's Path is /dev/nvidia7
I1012 05:08:46.997891 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.997895 1 nvidia.go:76] # Add first device ID: GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2--0
I1012 05:08:47.249585 1 nvidia.go:79] # Add last device ID: GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2--32509
I1012 05:08:47.249606 1 server.go:43] Device Map: map[GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415:6 GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2:7 GPU-60805828-8ab0-6124-67c4-9baff56d087b:0 GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01:1 GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a:2 GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5:3 GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991:4 GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf:5]
I1012 05:08:47.249644 1 server.go:44] Device List: [GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5 GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991 GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415 GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2 GPU-60805828-8ab0-6124-67c4-9baff56d087b GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01 GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a]
I1012 05:08:47.265532 1 podmanager.go:68] No need to update Capacity aliyun.com/gpu-count
I1012 05:08:47.266863 1 server.go:222] Starting to serve on /var/lib/kubelet/device-plugins/aliyungpushare.sock
I1012 05:08:47.267431 1 server.go:230] Registered device plugin with Kubelet
有没有人遇见过 k8s 1.16.3 nvidia-runtime 1.1-dev

你是不是按MiB为单位上报的GPU显存资源呀?我们默认是按照GiB为单位上报的显存资源,你的节点有大约32G显存,如果按MiB上报,那么会产生32509个Device ID上报给kubelet,造成device plugin和kubelet之间的GRPC通信出现问题。换成按GiB上报试试吧。

Hi, I also meet this issue. The capacity of aliyun.com/gu-mem is 46, but the Allocatable aliyun.com/gpu-mem is 0. I use kubectl inspect gpushare, there is no node to display. How can I solve this problem?
I also use journal -xefu kubelet on the node, the output is

1月 30 17:12:17 slave6 kubelet[1483]: E1130 17:12:17.560765 1483 endpoint.go:62] Can't create new endpoint with path /var/lib/kubelet/device-plugins/gpushare.sock err failed to dial device plugin: context deadline exceeded 11月 30 17:12:17 slave6 kubelet[1483]: E1130 17:12:17.560805 1483 manager.go:485] Failed to dial device plugin with request &RegisterRequest{Version:v1beta1,Endpoint:gpushare.sock,ResourceName:gpushare/gpu-mem,Options:nil,}: failed to dial device plugin: context deadline exceeded 11月 30 17:12:17 slave6 kubelet[1483]: I1130 17:12:17.607164 1483 manager.go:411] Got registration request from device plugin with resource name "gpushare/gpu-mem" 11月 30 17:12:17 slave6 kubelet[1483]: I1130 17:12:17.607507 1483 endpoint.go:179] parsed scheme: "" 11月 30 17:12:17 slave6 kubelet[1483]: I1130 17:12:17.607553 1483 endpoint.go:179] scheme "" not registered, fallback to default scheme 11月 30 17:12:17 slave6 kubelet[1483]: I1130 17:12:17.607594 1483 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/gpushare.sock <nil> 0 <nil>}] <nil> <nil>} 11月 30 17:12:17 slave6 kubelet[1483]: I1130 17:12:17.607614 1483 clientconn.go:933] ClientConn switching balancer to "pick_first" 11月 30 17:12:17 slave6 kubelet[1483]: W1130 17:12:17.607878 1483 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/device-plugins/gpushare.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/device-plugins/gpushare.sock: connect: no such file or directory". Reconnecting... 11月 30 17:12:17 slave6 kubelet[1483]: W1130 17:12:17.653596 1483 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/device-plugins/gpushare.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/device-plugins/gpushare.sock: connect: no such file or directory". Reconnecting...