GKE: access DCGM metrics from HPA
JulesBelveze opened this issue · 3 comments
Hi guys,
I have a GKE cluster and I am attempting to perform HPA based on GPU consumption.
I have successfully installed the DCGM exporter and I can observe the DCGM metrics from Prometheus, Grafana and Stackdriver.
However, I am trying to use the DCGM_FI_DEV_GPU_UTIL
metric for horizontal autoscaling. I can see it available:
>>> kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM_FI_DEV_GPU_UTIL
"name": "*/external.googleapis.com|prometheus|DCGM_FI_DEV_GPU_UTIL",
However, the following yaml file:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: my-scaler
namespace: dcgm-exporter
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ner
minReplicas: 2
maxReplicas: 10
metrics:
- type: External
external:
metric:
name: external.googleapis.com|prometheus|DCGM_FI_DEV_GPU_UTIL
target:
type: AverageValue
averageValue: 30
leads me to the following error:
Warning FailedComputeMetricsReplicas 1s (x4 over 49s) horizontal-pod-autoscaler failed to compute desired number of replicas based on listed metrics for Deployment/dcgm-exporter/ner: invalid metrics (1 invalid out of 1), first error is: failed to get external.googleapis.com|prometheus|DCGM_FI_DEV_GPU_UTIL external metric: unable to get external metric dcgm-exporter/external.googleapis.com|prometheus|DCGM_FI_DEV_GPU_UTIL/nil: unable to fetch metrics from external metrics API: the server could not find the descriptor for metric external.googleapis.com/prometheus/dcgm_fi_dev_gpu_util: googleapi: Error 404: Could not find descriptor for metric 'external.googleapis.com/prometheus/dcgm_fi_dev_gpu_util'., notFound
I have checked the namespace, tried to access the metric as an Object
but without success...
Any idea what could have gone wrong?
Cheers!
Hi,
Is the dcgm-exporter itself configured to collect that metric? It should be listed in the config csv file.
Also, I'd recommend switching to the DCGM_FI_PROF_* group of metrics to monitor utilization. The metric you are using is deprecated.
Hey @nikkon-dev thanks for your answer and the suggestion, I will switch to this group of metrics.
However, the DCGM_FI_DEV_GPU_UTIL
is indeed listed in the config csv file. And I can actually observe it on my Grafana dashboard. I feel like the issue is only being able to access it from the HPA..
I actually managed to access the DCGM
metrics from the HPA by modifying my dcgm-exporter
Service
and ServiceMonitor
(as suggested here) and this shows up:
>>> kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM_FI_DEV_GPU_UTIL
"name": "namespaces/DCGM_FI_DEV_GPU_UTIL",
"name": "jobs.batch/DCGM_FI_DEV_GPU_UTIL",
"name": "pods/DCGM_FI_DEV_GPU_UTIL",
"name": "services/DCGM_FI_DEV_GPU_UTIL",
Then the metric can be access from the HPA as an Object
.