[BUG]dashboard无法识别GPU
CaRRotOne opened this issue · 5 comments
CaRRotOne commented
What happened:
dashboard无法识别GPU,GPU为 nvidia显卡, K8S已经安装对应nvidia插件
What you expected to happen:
解决dashboard GPU显示问题
How to reproduce it:
Anything else we need to know?:
Environment:
SimonCqk commented
@CaRRotOne 好的,我们看下
麻烦提供一下集群的node信息,nvidia插件是否有正常工作,以及node allocatable资源上报是否符合预期
CaRRotOne commented
@CaRRotOne 好的,我们看下
麻烦提供一下集群的node信息,nvidia插件是否有正常工作,以及node allocatable资源上报是否符合预期
@SimonCqk
集群为rancher搭建的k8s,具体信息图下。node信息中看到Capacit GPU为4个,但Allocatable GPU为0。
kubectl cluster-info
Kubernetes master is running at https://ml.rancher.pudu.cn:9443/k8s/clusters/c-tzdlr
CoreDNS is running at https://ml.rancher.pudu.cn:9443/k8s/clusters/c-tzdlr/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
kubectl get nodes
NAME STATUS ROLES AGE VERSION
dell-poweredge-t640 Ready controlplane,etcd,worker 2d23h v1.17.17
nvidia插件已正常工作
kubectl get pods -n kube-system | grep nvidia
nvidia-device-plugin-daemonset-sfckg 1/1 Running 0 41m
node信息
Name: dell-poweredge-t640
Roles: controlplane,etcd,worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=dell-poweredge-t640
kubernetes.io/os=linux
node-role.kubernetes.io/controlplane=true
node-role.kubernetes.io/etcd=true
node-role.kubernetes.io/worker=true
Annotations: flannel.alpha.coreos.com/backend-data: {"VtepMAC":"f6:e5:f7:7e:d8:d6"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 192.168.161.90
node.alpha.kubernetes.io/ttl: 0
rke.cattle.io/external-ip: 192.168.161.90
rke.cattle.io/internal-ip: 192.168.161.90
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 24 Sep 2021 05:35:25 -0400
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: dell-poweredge-t640
AcquireTime: <unset>
RenewTime: Mon, 27 Sep 2021 08:20:40 -0400
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Mon, 27 Sep 2021 08:13:04 -0400 Mon, 27 Sep 2021 08:13:04 -0400 FlannelIsUp Flannel is running on this node
MemoryPressure False Mon, 27 Sep 2021 08:19:02 -0400 Fri, 24 Sep 2021 05:35:25 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 27 Sep 2021 08:19:02 -0400 Fri, 24 Sep 2021 05:35:25 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 27 Sep 2021 08:19:02 -0400 Fri, 24 Sep 2021 05:35:25 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 27 Sep 2021 08:19:02 -0400 Mon, 27 Sep 2021 08:13:01 -0400 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.161.90
Hostname: dell-poweredge-t640
Capacity:
cpu: 40
ephemeral-storage: 459924552Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131501508Ki
nvidia.com/gpu: 4
pods: 110
Allocatable:
cpu: 40
ephemeral-storage: 423866466422
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131399108Ki
nvidia.com/gpu: 0
pods: 110
System Info:
Machine ID: a8eb6cac33e701ae867269db5ce80e7f
System UUID: 4c4c4544-0058-3010-8038-b3c04f4a4633
Boot ID: aa03ac05-95f8-4d85-9c14-48d761375c2d
Kernel Version: 5.4.0-42-generic
OS Image: Ubuntu 18.04.5 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://20.10.7
Kubelet Version: v1.17.17
Kube-Proxy Version: v1.17.17
PodCIDR: 10.42.0.0/24
PodCIDRs: 10.42.0.0/24
Non-terminated Pods: (24 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
cattle-prometheus exporter-kube-state-cluster-monitoring-5dd6d5c9fd-qzfq4 100m (0%) 100m (0%) 130Mi (0%) 200Mi (0%) 34h
cattle-prometheus exporter-node-cluster-monitoring-r9f79 100m (0%) 200m (0%) 30Mi (0%) 200Mi (0%) 34h
cattle-prometheus grafana-cluster-monitoring-75c5cd5995-m77pz 150m (0%) 300m (0%) 150Mi (0%) 300Mi (0%) 34h
cattle-prometheus prometheus-cluster-monitoring-0 1100m (2%) 1800m (4%) 950Mi (0%) 1350Mi (1%) 34h
cattle-prometheus prometheus-operator-monitoring-operator-f9b9567b-hklgl 100m (0%) 200m (0%) 100Mi (0%) 500Mi (0%) 34h
cattle-system cattle-cluster-agent-6cc5cdcc54-5sq4j 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d22h
cattle-system cattle-node-agent-867qh 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d22h
cattle-system kube-api-auth-t24lb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d2h
ingress-nginx nginx-ingress-controller-ftjzp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d2h
istio-system istio-citadel-66864ff6b8-smnbx 10m (0%) 0 (0%) 0 (0%) 0 (0%) 34h
istio-system istio-galley-5bd9bf8b9c-wc8gg 10m (0%) 0 (0%) 0 (0%) 0 (0%) 34h
istio-system istio-pilot-674bdcbbf9-v2zcl 600m (1%) 3 (7%) 2176Mi (1%) 5Gi (3%) 34h
istio-system istio-policy-6d9f4577db-s96ht 1100m (2%) 6800m (17%) 1152Mi (0%) 5Gi (3%) 34h
istio-system istio-sidecar-injector-9bcfb645-vp22d 10m (0%) 0 (0%) 0 (0%) 0 (0%) 34h
istio-system istio-telemetry-664b6dfd44-df5sq 1100m (2%) 6800m (17%) 1152Mi (0%) 5Gi (3%) 34h
istio-system istio-tracing-cc6c8c677-crd6g 100m (0%) 500m (1%) 100Mi (0%) 1Gi (0%) 34h
istio-system kiali-79c4c46468-pb5dv 10m (0%) 0 (0%) 0 (0%) 0 (0%) 34h
kube-system coredns-6b84d75d99-5dvkt 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 3d2h
kube-system coredns-autoscaler-5c4b6999d9-qt25w 20m (0%) 0 (0%) 10Mi (0%) 0 (0%) 3d2h
kube-system kube-flannel-slmh5 100m (0%) 100m (0%) 50Mi (0%) 50Mi (0%) 3d2h
kube-system metrics-server-7579449c57-t9jmf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d2h
kube-system nvidia-device-plugin-daemonset-sfckg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4h15m
kubedl-system kubedl-7f4c55dfc9-8n2pc 1024m (2%) 2048m (5%) 1Gi (0%) 2Gi (1%) 150m
kubedl-system kubedl-dashboard-787b49c8d7-7lbmg 1 (2%) 0 (0%) 500Mi (0%) 0 (0%) 150m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 6734m (16%) 21848m (54%)
memory 7594Mi (5%) 21202Mi (16%)
ephemeral-storage 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NodeAllocatableEnforced 7m49s kubelet, dell-poweredge-t640 Updated Node Allocatable limit across pods
Normal NodeHasSufficientMemory 7m48s (x3 over 7m49s) kubelet, dell-poweredge-t640 Node dell-poweredge-t640 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 7m48s (x3 over 7m49s) kubelet, dell-poweredge-t640 Node dell-poweredge-t640 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 7m48s (x3 over 7m49s) kubelet, dell-poweredge-t640 Node dell-poweredge-t640 status is now: NodeHasSufficientPID
Normal NodeNotReady 7m48s kubelet, dell-poweredge-t640 Node dell-poweredge-t640 status is now: NodeNotReady
Normal NodeReady 7m47s kubelet, dell-poweredge-t640 Node dell-poweredge-t640 status is now: NodeReady
Normal Starting 7m46s kube-proxy, dell-poweredge-t640 Starting kube-proxy.
CaRRotOne commented
已解决,docker的default-runtime需要设置成nvidia
SimonCqk commented
已解决,docker的default-runtime需要设置成nvidia
感谢关注,那我关闭这个issue啦 :)