GoogleCloudPlatform/container-engine-accelerators

device plugin to emit metrics?

lsjostro opened this issue · 4 comments

Just a question, is the device plugin responsible for emitting metrics for say allocation or is this handled by kubelet? cAdvisor had acceleration metrics previously which now is deprecated in k8s 1.11.

kubelet currently only has kubelet_device_plugin_alloc_latency_microseconds metric what I can see.

Would you be willing to accept a PR for exposing a metrics endpoint?

cAdvisor will continue to be the source of accelerator metrics till a replacement is ready. Currently, it exposes three metrics duty_cycle, memory_used and memory_total.

Not sure what you exactly mean by allocation metrics, but currently heapster (which is also deprecated) exposes how many GPUs a pod has requested. The other place such metrics are exposed is kube-state-metrics.

@dashpole is working on a proposal that will give device plugins enough information to properly expose pod and container level metrics.

What exact metrics do you want to expose?

Alright, cool. I would like to have metrics on how many gpus are allocated and in use within a node pool for instance. You had that previously from kube-state-metrics when feature flag acceleration was enabled in kubelet. kube_node_status_allocatable_nvidia_gpu_cards Now when that is deprecated in 1.11 I just wounder how I can get that back?

I just checked out kube-state-metrics and it supports extended resources now.

So the metrics you want are kube_node_status_capacity and kube_node_status_allocatable for node. And kube_pod_container_resource_requests and kube_pod_container_resource_limits for containers. All of them have a resource field which should contain nvidia_com_gpu (don't know why they didn't use nvidia.com/gpu but anyway).

Thanks! I’ll look into that! Cheers!