dcgm-exporter cannnot installed successfully on 2080Ti
ReyRen opened this issue · 8 comments
root@master:~# helm install --generate-name gpu-helm-charts/dcgm-exporter --set arguments=null
NAME: dcgm-exporter-1617960354
LAST DEPLOYED: Fri Apr 9 17:25:57 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
export POD_NAME=$(kubectl get pods -n default -l "app.kubernetes.io/name=dcgm-exporter,app.kubernetes.io/instance=dcgm-exporter-1617960354" -o jsonpath="{.items[0].metadata.name}")
kubectl -n default port-forward $POD_NAME 8080:9400 &
echo "Visit http://127.0.0.1:8080/metrics to use your application"
root@master:~# kubectl logs -f dcgm-exporter-1617960354-5jxgh
time="2021-04-09T09:26:00Z" level=info msg="Starting dcgm-exporter"
time="2021-04-09T09:26:00Z" level=info msg="DCGM successfully initialized!"
time="2021-04-09T09:26:00Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU"
time="2021-04-09T09:26:00Z" level=info msg="Kubernetes metrics collection enabled!"
time="2021-04-09T09:26:00Z" level=info msg="Pipeline starting"
time="2021-04-09T09:26:00Z" level=info msg="Starting webserver"
It doesn't work even if using --set arguments=null
option.
default dcgm-exporter-1617960354-5jxgh 0/1 CrashLoopBackOff 5 2m48s
default dcgm-exporter-1617960354-74ddv 0/1 CrashLoopBackOff 5 2m48s
default dcgm-exporter-1617960354-7cwq7 0/1 CrashLoopBackOff 5 2m48s
default dcgm-exporter-1617960354-cl525 0/1 CrashLoopBackOff 5 2m48s
default dcgm-exporter-1617960354-jlx66 0/1 CrashLoopBackOff 5 2m48s
The helm version I used is v.2.3.1
hello? :)
ReyRen - you say the exporter didn't start, but I see the message "Starting webserver". Is your issue that it isn't collecting DCP metrics?
@dbeer thanks get reply from you.
Yes, the issue is about "dcgm-exporter cannnot installed" with helm installed, and the reason caused Crashloopbackoff is
root@master:~# kubectl logs -f dcgm-exporter-1618278113-kvdd7
time="2021-04-13T01:41:58Z" level=info msg="Starting dcgm-exporter"
time="2021-04-13T01:41:58Z" level=info msg="DCGM successfully initialized!"
time="2021-04-13T01:41:58Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU"
time="2021-04-13T01:41:58Z" level=warning msg="Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled"
time="2021-04-13T01:41:58Z" level=warning msg="Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled"
time="2021-04-13T01:41:58Z" level=warning msg="Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled"
time="2021-04-13T01:41:58Z" level=warning msg="Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled"
time="2021-04-13T01:41:58Z" level=warning msg="Skipping line 64 ('DCGM_FI_PROF_PCIE_RX_BYTES'): DCP metrics not enabled"
time="2021-04-13T01:41:58Z" level=info msg="Kubernetes metrics collection enabled!"
time="2021-04-13T01:41:58Z" level=info msg="Starting webserver"
time="2021-04-13T01:41:58Z" level=info msg="Pipeline starting"
as for as I noticed.
So, how can I workaround.
Thanks again
I did set the following in the values.yaml & able to overcome crashloopbackoff error.
extraEnv:
- name: "DCGM_EXPORTER_INTERVAL"
value: "5000"
ReyRen - can you post the error you're seeing where Helm says the exporter can't be installed? I don't see it in your previous posts.
Hi,
DCP metrics (DCGM_FI_PROF_*) are not supported on 2080Ti cards. You need to provide a CSV configuration file without such metrics (they present in the default CSV config file).
WBR,
Nik
@nikkon-dev
Which GPU support the DCP metrics?
Best regards.
Kaka