Error occurred in CUDA_CALL: 35 after daemonset created.

Question

Error occurred in CUDA_CALL: 35 after daemonset created.

TomDrake-BabbleLabs opened this issue 5 years ago · 5 comments

TomDrake-BabbleLabs commented 5 years ago

After creating the daemonset according to https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ without error. I manually pulled a docker container containing our gpu-centric application and tried running it to verify it's ability to interact with the gpu. It failed with 'Error occurred in CUDA_CALL: 35'

I'm not certain how to ensure that the amd/nvidia driver / libraries are installed. Please advise.

I found the following running on the cluster nodes:

docker ps | grep -i nvid
9313f94c7a91 c6bf69abba08 "/usr/bin/nvidia-g..." 17 hours ago Up 17 hours
k8s_nvidia-gpu-device-plugin_nvidia-gpu-device-plugin-j9wcp_kube-system_e8085fbd-96e1-11e9-9117-42010a8a0074_1
a6ff9582a414 k8s.gcr.io/pause:3.1 "/pause" 17 hours ago Up 17 hours
k8s_POD_nvidia-gpu-device-plugin-j9wcp_kube-system_e8085fbd-96e1-11e9-9117-42010a8a0074_1
122ef629bc2d 2b58359142b0 "/pause" 17 hours ago Up 17 hours
k8s_pause_nvidia-driver-installer-7k7r8_kube-system_e814bf5a-96e1-11e9-9117-42010a8a0074_1
e80b741c1d72 a8fd6d7f4414 "nvidia-device-plugin" 17 hours ago Up 17 hours
k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-f4mxg_kube-system_e8010024-96e1-11e9-9117-42010a8a0074_1
53e3b403ef15 k8s.gcr.io/pause:3.1 "/pause" 17 hours ago Up 17 hours
k8s_POD_nvidia-device-plugin-daemonset-f4mxg_kube-system_e8010024-96e1-11e9-9117-42010a8a0074_1
ce99e7e6536e k8s.gcr.io/pause:3.1 "/pause" 17 hours ago Up 17 hours
k8s_POD_nvidia-driver-installer-7k7r8_kube-system_e814bf5a-96e1-11e9-9117-42010a8a0074_1

Answer 1 · 2019-06-29T18:39:22.000Z

@TomDrake-BabbleLabs You can check the presence of the GPU on the node using

kubectl describe nodes

Within the output, under Allocatable/Capacity, there should be a field for nvidia.com/gpu.

You can also ssh into your node and check the presence of the devices under /dev.

The nvidia libraries should be under /usr/local/nvidia from within your container if you are using the gpu device plugin, which it looks like you are. You can also check the output of ldconfig -p for specific libraries.

Answer 2 · 2019-06-29T20:12:27.000Z

I've had this exact same issue, and been fiddling with it for some time. What I found is that there's a Daemonset that updated with the latest driver, for COS at least. For me, that was the solution, install the daemonset present on https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/daemonset.yaml .

Answer 3 · 2019-07-03T22:45:30.000Z

Mateus, Thank you for your reply. My pod is now able to access the GPU.

…

On Sat, Jun 29, 2019 at 1:12 PM Mateus Interciso ***@***.***> wrote: I've had this exact same issue, and been fiddling with it for some time. What I found is that there's a Daemonset that updated with the latest driver, for COS at least. For me, that was the solution, install the daemonset present on https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/daemonset.yaml . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#115?email_source=notifications&email_token=AMOXI4QLS6TGAK5REVTFYFTP46623A5CNFSM4H3WUHPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY37MUY#issuecomment-506984019>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMOXI4SCMTLRH7452IXUO43P46623ANCNFSM4H3WUHPA> .

Answer 4 · 2019-07-04T00:31:57.000Z

Mateus, Thank you for your reply. My pod is now able to access the GPU.
…
On Sat, Jun 29, 2019 at 1:12 PM Mateus Interciso @.***> wrote: I've had this exact same issue, and been fiddling with it for some time. What I found is that there's a Daemonset that updated with the latest driver, for COS at least. For me, that was the solution, install the daemonset present on https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/daemonset.yaml . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#115?email_source=notifications&email_token=AMOXI4QLS6TGAK5REVTFYFTP46623A5CNFSM4H3WUHPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY37MUY#issuecomment-506984019>, or mute the thread https://github.com/notifications/unsubscribe-auth/AMOXI4SCMTLRH7452IXUO43P46623ANCNFSM4H3WUHPA .

Glad it helped!

Answer 5 · 2019-08-21T17:37:17.000Z

Answer provided by minterciso solved the problem.