NVIDIA/gpu-monitoring-tools

Sudden error message about /run/prometheus/dcgm.prom

kubernetian opened this issue · 0 comments

Hi All

We are using the DCGM exporter for GPU monitoring, whilest implementing everything worked perfectly. But since a week or 2 (atleast I'm aware that we have this problem since about a week or 2) we are getting an error message that I think has something to do with the structure of the /run/prometheus/dcgm.prom file.

I was wondering if any of you already encountered / solved this issue?

time="2020-11-09T07:46:25Z" level=error msg="Error parsing \"/run/prometheus/dcgm.prom\": text format parsing error in line 488: second TYPE line for metric name \"dcgm_sm_clock\", or TYPE reported after samples" source="textfile.go:212"

I can't really give more input because I barely have info on what changed in the setup. I know that we moved from a manual deployment state to a flux deployment state, but we're using te exact same yaml file.

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: kube-metrics
spec:
  template:
    metadata:
      labels:
        app: dcgm-exporter
      name: dcgm-exporter
    spec:
      nodeSelector:
        hardware-type: NVIDIAGPU
      containers:
      - image: quay.io/prometheus/node-exporter:v0.16.0
        name: node-exporter
        args:
        - "--web.listen-address=0.0.0.0:9101"
        - "--path.procfs=/host/proc"
        - "--path.sysfs=/host/sys"
        - "--collector.textfile.directory=/run/prometheus"
        - "--no-collector.arp"
        - "--no-collector.bcache"
        - "--no-collector.bonding"
        - "--no-collector.conntrack"
        - "--no-collector.cpu"
        - "--no-collector.diskstats"
        - "--no-collector.edac"
        - "--no-collector.entropy"
        - "--no-collector.filefd"
        - "--no-collector.filesystem"
        - "--no-collector.hwmon"
        - "--no-collector.infiniband"
        - "--no-collector.ipvs"
        - "--no-collector.loadavg"
        - "--no-collector.mdadm"
        - "--no-collector.meminfo"
        - "--no-collector.netdev"
        - "--no-collector.netstat"
        - "--no-collector.nfs"
        - "--no-collector.nfsd"
        - "--no-collector.sockstat"
        - "--no-collector.stat"
        - "--no-collector.time"
        - "--no-collector.timex"
        - "--no-collector.uname"
        - "--no-collector.vmstat"
        - "--no-collector.wifi"
        - "--no-collector.xfs"
        - "--no-collector.zfs"
        ports:
        - name: metrics
          containerPort: 9101
          hostPort: 9101
        resources:
          requests:
            memory: 30Mi
            cpu: 100m
          limits:
            memory: 50Mi
            cpu: 200m
        volumeMounts:
        - name: proc
          readOnly:  true
          mountPath: /host/proc
        - name: sys
          readOnly: true
          mountPath: /host/sys
        - name: collector-textfiles
          readOnly: true
          mountPath: /run/prometheus
      - image: nvidia/dcgm-exporter:1.4.6
        name: nvidia-dcgm-exporter
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
        volumeMounts:
        - name: collector-textfiles
          mountPath: /run/prometheus

      hostNetwork: true
      hostPID: true

      volumes:
      - name: proc
        hostPath:
          path: /proc
      - name: sys
        hostPath:
          path: /sys
      - name: collector-textfiles
        emptyDir: {}

Thanks in advance!
Joris