AliyunContainerService/gpushare-device-plugin

No assume timestamp for pod tf-jupyter-... so it's not GPUSharedAssumed assumed pod.

jear opened this issue · 2 comments

jear commented

Hi, i have been able to make it work with a kubespray k8s 1.13.5 cluster, with worker single-node single-GPU.
But I have a bug with a k8s 1.15.3 single-node dual GPU.

Can you help?

k describe pod tf-jupyter-67b475bf4d-4v2nf
...
Warning Failed (x2 over ) kubelet, node-2gpu Error: failed to start container "tensorflow": Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-8129MiB-to-run\\n\""": unknown

[root@node-2gpu ~]# docker logs -f 575171f1ff33

I1112 23:06:01.678676 1 allocate.go:46] ----Allocating GPU for gpu mem is started----
I1112 23:06:01.678717 1 allocate.go:57] RequestPodGPUs: 8129
I1112 23:06:01.678733 1 allocate.go:61] checking...
I1112 23:06:01.705009 1 podmanager.go:112] all pod list [{{ } {tf-jupyter-67b475bf4d-4v2nf tf-jupyter-67b475bf4d- jhub /api/v1/namespaces/jhub/pods/tf-jupyter-67b475bf4d-4v2nf a66921cd-bded-460b-bf4d-beb35c17229a 16993630 0 2019-11-12 17:22:48 +0000 UTC map[app:tf-jupyter pod-template-hash:67b475bf4d] map[] [{apps/v1 ReplicaSet tf-jupyter-67b475bf4d 74a14098-b83d-419f-a8eb-d9bb6fe0ea93 0xc4204a65a7 0xc4204a65a8}] nil [] } {[{bin {&HostPathVolumeSource{Path:/usr/bin,Type:,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}} {lib {&HostPathVolumeSource{Path:/usr/lib,Type:,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}} {default-token-kjd8r {nil nil nil nil nil &SecretVolumeSource{SecretName:default-token-kjd8r,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}] [] [{tensorflow tensorflow/tensorflow:1.12.0-gpu [] [] [{ 0 8888 TCP }] [] [] {map[aliyun.com/gpu-mem:{{8129 0} {} 8129 DecimalSI}] map[aliyun.com/gpu-mem:{{8129 0} {} 8129 DecimalSI}]} [{bin false /usr/local/nvidia/bin } {lib false /usr/local/nvidia/lib } {default-token-kjd8r true /var/run/secrets/kubernetes.io/serviceaccount }] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}] Always 0xc4204a6850 ClusterFirst map[accelerator:nvidia-tesla-m6] default default node-2gpu false false false &PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],} [] nil default-scheduler [{node.kubernetes.io/not-ready Exists NoExecute 0xc4204a6960} {node.kubernetes.io/unreachable Exists NoExecute 0xc4204a6980}] [] 0xc4204a6990 nil []} {Pending [{PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2019-11-12 17:22:48 +0000 UTC }] [] [] BestEffort}}]
I1112 23:06:01.705505 1 podmanager.go:123] list pod tf-jupyter-67b475bf4d-4v2nf in ns jhub in node node-2gpu and status is Pending
I1112 23:06:01.705555 1 podutils.go:81] No assume timestamp for pod tf-jupyter-67b475bf4d-4v2nf in namespace jhub, so it's not GPUSharedAssumed assumed pod.
W1112 23:06:01.705573 1 allocate.go:152] invalid allocation requst: request GPU memory 8129 can't be satisfied.

jear commented

Fixed conf and restarted scheduler

Hi, How did you fix this issue? I face this when I set a GPU limit on my pod.
The strange thing about it is that when I set NVIDIA_VISIBLE_DEVICES=0 the error will fade but the limit still not working.