NVIDIA/gpu-monitoring-tools

GKE: access DCGM metrics from HPA

JulesBelveze opened this issue · 3 comments

Hi guys,

I have a GKE cluster and I am attempting to perform HPA based on GPU consumption.
I have successfully installed the DCGM exporter and I can observe the DCGM metrics from Prometheus, Grafana and Stackdriver.
However, I am trying to use the DCGM_FI_DEV_GPU_UTIL metric for horizontal autoscaling. I can see it available:

>>> kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM_FI_DEV_GPU_UTIL
"name": "*/external.googleapis.com|prometheus|DCGM_FI_DEV_GPU_UTIL",

However, the following yaml file:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: my-scaler
  namespace: dcgm-exporter
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ner
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: External
      external:
        metric:
          name: external.googleapis.com|prometheus|DCGM_FI_DEV_GPU_UTIL
        target:
          type: AverageValue
          averageValue: 30

leads me to the following error:

  Warning  FailedComputeMetricsReplicas  1s (x4 over 49s)  horizontal-pod-autoscaler  failed to compute desired number of replicas based on listed metrics for Deployment/dcgm-exporter/ner: invalid metrics (1 invalid out of 1), first error is: failed to get external.googleapis.com|prometheus|DCGM_FI_DEV_GPU_UTIL external metric: unable to get external metric dcgm-exporter/external.googleapis.com|prometheus|DCGM_FI_DEV_GPU_UTIL/nil: unable to fetch metrics from external metrics API: the server could not find the descriptor for metric external.googleapis.com/prometheus/dcgm_fi_dev_gpu_util: googleapi: Error 404: Could not find descriptor for metric 'external.googleapis.com/prometheus/dcgm_fi_dev_gpu_util'., notFound

I have checked the namespace, tried to access the metric as an Object but without success...
Any idea what could have gone wrong?

Cheers!

Hi,
Is the dcgm-exporter itself configured to collect that metric? It should be listed in the config csv file.
Also, I'd recommend switching to the DCGM_FI_PROF_* group of metrics to monitor utilization. The metric you are using is deprecated.

Hey @nikkon-dev thanks for your answer and the suggestion, I will switch to this group of metrics.

However, the DCGM_FI_DEV_GPU_UTIL is indeed listed in the config csv file. And I can actually observe it on my Grafana dashboard. I feel like the issue is only being able to access it from the HPA..

I actually managed to access the DCGM metrics from the HPA by modifying my dcgm-exporter Service and ServiceMonitor (as suggested here) and this shows up:

>>> kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM_FI_DEV_GPU_UTIL
      "name": "namespaces/DCGM_FI_DEV_GPU_UTIL",
      "name": "jobs.batch/DCGM_FI_DEV_GPU_UTIL",
      "name": "pods/DCGM_FI_DEV_GPU_UTIL",
      "name": "services/DCGM_FI_DEV_GPU_UTIL",

Then the metric can be access from the HPA as an Object.