Error occurred in CUDA_CALL: 35 after daemonset created.
TomDrake-BabbleLabs opened this issue · 5 comments
After creating the daemonset according to https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ without error. I manually pulled a docker container containing our gpu-centric application and tried running it to verify it's ability to interact with the gpu. It failed with 'Error occurred in CUDA_CALL: 35'
I'm not certain how to ensure that the amd/nvidia driver / libraries are installed. Please advise.
I found the following running on the cluster nodes:
docker ps | grep -i nvid
9313f94c7a91 c6bf69abba08 "/usr/bin/nvidia-g..." 17 hours ago Up 17 hours
k8s_nvidia-gpu-device-plugin_nvidia-gpu-device-plugin-j9wcp_kube-system_e8085fbd-96e1-11e9-9117-42010a8a0074_1
a6ff9582a414 k8s.gcr.io/pause:3.1 "/pause" 17 hours ago Up 17 hours
k8s_POD_nvidia-gpu-device-plugin-j9wcp_kube-system_e8085fbd-96e1-11e9-9117-42010a8a0074_1
122ef629bc2d 2b58359142b0 "/pause" 17 hours ago Up 17 hours
k8s_pause_nvidia-driver-installer-7k7r8_kube-system_e814bf5a-96e1-11e9-9117-42010a8a0074_1
e80b741c1d72 a8fd6d7f4414 "nvidia-device-plugin" 17 hours ago Up 17 hours
k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-f4mxg_kube-system_e8010024-96e1-11e9-9117-42010a8a0074_1
53e3b403ef15 k8s.gcr.io/pause:3.1 "/pause" 17 hours ago Up 17 hours
k8s_POD_nvidia-device-plugin-daemonset-f4mxg_kube-system_e8010024-96e1-11e9-9117-42010a8a0074_1
ce99e7e6536e k8s.gcr.io/pause:3.1 "/pause" 17 hours ago Up 17 hours
k8s_POD_nvidia-driver-installer-7k7r8_kube-system_e814bf5a-96e1-11e9-9117-42010a8a0074_1
@TomDrake-BabbleLabs You can check the presence of the GPU on the node using
kubectl describe nodes
Within the output, under Allocatable/Capacity, there should be a field for nvidia.com/gpu
.
You can also ssh into your node and check the presence of the devices under /dev
.
The nvidia libraries should be under /usr/local/nvidia
from within your container if you are using the gpu device plugin, which it looks like you are. You can also check the output of ldconfig -p
for specific libraries.
I've had this exact same issue, and been fiddling with it for some time. What I found is that there's a Daemonset that updated with the latest driver, for COS at least. For me, that was the solution, install the daemonset present on https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/daemonset.yaml .
Mateus, Thank you for your reply. My pod is now able to access the GPU.
…
On Sat, Jun 29, 2019 at 1:12 PM Mateus Interciso @.***> wrote: I've had this exact same issue, and been fiddling with it for some time. What I found is that there's a Daemonset that updated with the latest driver, for COS at least. For me, that was the solution, install the daemonset present on https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/daemonset.yaml . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#115?email_source=notifications&email_token=AMOXI4QLS6TGAK5REVTFYFTP46623A5CNFSM4H3WUHPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY37MUY#issuecomment-506984019>, or mute the thread https://github.com/notifications/unsubscribe-auth/AMOXI4SCMTLRH7452IXUO43P46623ANCNFSM4H3WUHPA .
Glad it helped!
Answer provided by minterciso solved the problem.