Unable to schedule pod with: Insufficient aliyun.com/gpu-mem
k0nstantinv opened this issue · 1 comments
k0nstantinv commented
Hi! I've installed all the software from the docs https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md
I've configured all the docker/k8s components, but scheduler still can't assign pod to node with:
Warning FailedScheduling 4m25s (x23 over 20m) default-scheduler 0/72 nodes are available: 72 Insufficient aliyun.com/gpu-mem.
Everything seem to be running correctly on my nodes:
gpushare-device-plugin-ds-5wpdx 1/1 Running 0 5m50s 10.48.171.12 node-gpu13 <none> <none>
gpushare-device-plugin-ds-5xdfm 1/1 Running 0 5m50s 10.48.171.35 node-gpu03 <none> <none>
gpushare-device-plugin-ds-7hw6d 1/1 Running 0 5m50s 10.48.171.17 node-gpu04 <none> <none>
gpushare-device-plugin-ds-7zwd9 1/1 Running 0 5m50s 10.48.167.16 node-gpu09 <none> <none>
gpushare-device-plugin-ds-9zdvn 1/1 Running 0 5m50s 10.48.171.13 node-gpu12 <none> <none>
gpushare-device-plugin-ds-fztlx 1/1 Running 0 5m50s 10.48.171.18 node-gpu02 <none> <none>
gpushare-device-plugin-ds-g975b 1/1 Running 0 5m49s 10.48.163.19 node-gpu14 <none> <none>
gpushare-device-plugin-ds-grfnf 1/1 Running 0 5m50s 10.48.171.14 node-gpu11 <none> <none>
gpushare-device-plugin-ds-jjjzj 1/1 Running 0 5m50s 10.48.163.20 node-gpu08 <none> <none>
gpushare-device-plugin-ds-k4kbl 1/1 Running 0 5m50s 10.48.167.17 node-gpu10 <none> <none>
gpushare-device-plugin-ds-m29s9 1/1 Running 0 5m50s 10.48.163.22 node-gpu07 <none> <none>
gpushare-device-plugin-ds-p65cq 1/1 Running 0 5m50s 10.48.163.23 node-gpu06 <none> <none>
gpushare-device-plugin-ds-rf5x5 1/1 Running 0 5m50s 10.48.167.18 node-gpu01 <none> <none>
gpushare-device-plugin-ds-xxqxh 1/1 Running 0 5m50s 10.48.163.24 node-gpu05 <none> <none>
gpushare-schd-extender-68dfcdb465-m2m6z 1/1 Running 0 37m 10.48.204.105 master01 <none> <none>
master01:~# kubectl inspect gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB)
node-gpu03 10.48.171.35 0/31 0/31
node-gpu09 10.48.167.16 0/31 0/31
node-gpu11 10.48.171.14 0/31 0/31
node-gpu14 10.48.163.19 0/31 0/31
node-gpu01 10.48.167.18 0/31 0/31
node-gpu04 10.48.171.17 0/31 0/31
node-gpu07 10.48.163.22 0/31 0/31
node-gpu05 10.48.163.24 0/31 0/31
node-gpu08 10.48.163.20 0/31 0/31
node-gpu10 10.48.167.17 0/31 0/31
node-gpu13 10.48.171.12 0/31 0/31
node-gpu02 10.48.171.18 0/31 0/31
node-gpu06 10.48.163.23 0/31 0/31
node-gpu12 10.48.171.13 0/31 0/31
scheduler output:
Aug 04 17:57:28 master01 kube-scheduler[17483]: I0804 17:57:28.955978 17483 factory.go:341] Creating scheduler from configuration: {{ } [] [] [{http://127.0.0.1:32766/gpushare-scheduler filter 0 bind false <nil> 0s true [{aliyun.com/gpu-mem false}] false}] 0 false}
...
Aug 04 18:38:44 master01 kube-scheduler[53986]: I0804 18:38:44.654499 53986 factory.go:382] Creating extender with config {URLPrefix:http://127.0.0.1:32766/gpushare-scheduler FilterVerb:filter PreemptVerb: PrioritizeVerb: Weight:0 BindVerb:bind EnableHTTPS:false TLSConfig:<nil> HTTPTimeout:0s NodeCacheCapable:true ManagedResources:[{Name:aliyun.com/gpu-mem IgnoredByScheduler:false}] Ignorable:false}
my typical gpu-node outputs:
- kubectl describe:
Hostname: node-gpu01
Capacity:
aliyun.com/gpu-count: 1
aliyun.com/gpu-mem: 31
...
Allocatable:
aliyun.com/gpu-count: 1
aliyun.com/gpu-mem: 31
- kubelet
node-gpu01 kubelet[69306]: I0804 17:53:28.207639 69306 setters.go:283] Update capacity for aliyun.com/gpu-mem to 31
- docker nvidia-smi
node-gpu01:~# docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
Unable to find image 'nvidia/cuda:10.0-base' locally
10.0-base: Pulling from nvidia/cuda
7ddbc47eeb70: Pull complete
c1bbdc448b72: Pull complete
8c3b70e39044: Pull complete
45d437916d57: Pull complete
d8f1569ddae6: Pull complete
de5a2c57c41d: Pull complete
ea6f04a00543: Pull complete
Digest: sha256:e6e1001f286d084f8a3aea991afbcfe92cd389ad1f4883491d43631f152f175e
Status: Downloaded newer image for nvidia/cuda:10.0-base
Tue Aug 4 14:08:26 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 |
| N/A 32C P0 25W / 250W | 12MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
So, here is a Pod gpu-player with the exact same image from demo video which can't be scheduled due to Insufficient aliyun.com/gpu-mem resource
kubectl -n gpu-test describe pod gpu-player-f576f5dd4-njhrs
Name: gpu-player-f576f5dd4-njhrs
Namespace: gpu-test
Priority: 100
PriorityClassName: default-priority
Node: <none>
Labels: app=gpu-player
pod-template-hash=f576f5dd4
Annotations: <none>
Status: Pending
IP:
Controlled By: ReplicaSet/gpu-player-f576f5dd4
Containers:
gpu-player:
Image: cheyang/gpu-player
Port: <none>
Host Port: <none>
Limits:
aliyun.com/gpu-mem: 512
Requests:
aliyun.com/gpu-mem: 512
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-mjdsm (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-mjdsm:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-mjdsm
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
pool=automated-moderation:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m15s (x895 over 17h) default-scheduler 0/72 nodes are available: 72 Insufficient aliyun.com/gpu-mem.
Looks like my k8s scheduler doesn't know about custom aliyun.com/gpu-mem resource. What's wrong?
I didn't find any errors in logs, but I'm ready to post any logs, versions, if necessary.
k0nstantinv commented