NVIDIA/gpu-monitoring-tools

dcgm-exporter cannnot installed successfully on 2080Ti

ReyRen opened this issue · 8 comments

root@master:~# helm install --generate-name  gpu-helm-charts/dcgm-exporter --set arguments=null
NAME: dcgm-exporter-1617960354
LAST DEPLOYED: Fri Apr  9 17:25:57 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
  export POD_NAME=$(kubectl get pods -n default -l "app.kubernetes.io/name=dcgm-exporter,app.kubernetes.io/instance=dcgm-exporter-1617960354" -o jsonpath="{.items[0].metadata.name}")
  kubectl -n default port-forward $POD_NAME 8080:9400 &
  echo "Visit http://127.0.0.1:8080/metrics to use your application"
root@master:~# kubectl logs -f  dcgm-exporter-1617960354-5jxgh
time="2021-04-09T09:26:00Z" level=info msg="Starting dcgm-exporter"
time="2021-04-09T09:26:00Z" level=info msg="DCGM successfully initialized!"
time="2021-04-09T09:26:00Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU"
time="2021-04-09T09:26:00Z" level=info msg="Kubernetes metrics collection enabled!"
time="2021-04-09T09:26:00Z" level=info msg="Pipeline starting"
time="2021-04-09T09:26:00Z" level=info msg="Starting webserver"

It doesn't work even if using --set arguments=null option.

default       dcgm-exporter-1617960354-5jxgh                                    0/1     CrashLoopBackOff    5          2m48s
default       dcgm-exporter-1617960354-74ddv                                    0/1     CrashLoopBackOff    5          2m48s
default       dcgm-exporter-1617960354-7cwq7                                    0/1     CrashLoopBackOff    5          2m48s
default       dcgm-exporter-1617960354-cl525                                    0/1     CrashLoopBackOff    5          2m48s
default       dcgm-exporter-1617960354-jlx66                                    0/1     CrashLoopBackOff    5          2m48s

The helm version I used is v.2.3.1

hello? :)

dbeer commented

ReyRen - you say the exporter didn't start, but I see the message "Starting webserver". Is your issue that it isn't collecting DCP metrics?

@dbeer thanks get reply from you.
Yes, the issue is about "dcgm-exporter cannnot installed" with helm installed, and the reason caused Crashloopbackoff is

root@master:~# kubectl logs -f dcgm-exporter-1618278113-kvdd7
time="2021-04-13T01:41:58Z" level=info msg="Starting dcgm-exporter"
time="2021-04-13T01:41:58Z" level=info msg="DCGM successfully initialized!"
time="2021-04-13T01:41:58Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU"
time="2021-04-13T01:41:58Z" level=warning msg="Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled"
time="2021-04-13T01:41:58Z" level=warning msg="Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled"
time="2021-04-13T01:41:58Z" level=warning msg="Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled"
time="2021-04-13T01:41:58Z" level=warning msg="Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled"
time="2021-04-13T01:41:58Z" level=warning msg="Skipping line 64 ('DCGM_FI_PROF_PCIE_RX_BYTES'): DCP metrics not enabled"
time="2021-04-13T01:41:58Z" level=info msg="Kubernetes metrics collection enabled!"
time="2021-04-13T01:41:58Z" level=info msg="Starting webserver"
time="2021-04-13T01:41:58Z" level=info msg="Pipeline starting"

as for as I noticed.
So, how can I workaround.
Thanks again

I did set the following in the values.yaml & able to overcome crashloopbackoff error.

extraEnv:
  - name: "DCGM_EXPORTER_INTERVAL"
    value: "5000"
dbeer commented

ReyRen - can you post the error you're seeing where Helm says the exporter can't be installed? I don't see it in your previous posts.

Hi,

DCP metrics (DCGM_FI_PROF_*) are not supported on 2080Ti cards. You need to provide a CSV configuration file without such metrics (they present in the default CSV config file).

WBR,
Nik

@nikkon-dev
Which GPU support the DCP metrics?

Best regards.
Kaka

@Kaka1127,

The DCP metrics are supported for Datacenter grade GPUs (former Tesla brands).

WBR,
Nik