NVIDIA/cloud-native-stack

GPU Driver Container Won't Start

BHSDuncan opened this issue · 4 comments

Essentially I'm seeing what's in this ticket: NVIDIA/gpu-operator#564 (when I start up my machine running a cluster with a version of CNS installed, currently an old one, like 9.x)

...and because I'm using one of the playbooks from this repo, I'm not sure how to resolve this issue.

I'm also unsure as to why the issue is happening now...I've been running this on a machine since last fall, but the issue linked above pre-dates it.

Will updating to the latest CNS version solve this issue? Or will it still be a problem, given that it looks like the install.sh and Dockerfile(s) are pretty much the same. (I'll probably try doing this anyway on a test box but I wanted to ask here as well.)

Thank you.

@BHSDuncan I would recommend to try CNS 10.4 or CNS 11.1 with cns_nvidia_driver: yes flag in cns_values_10.4.yaml or cns_values_11.1.yaml file and trigger the installation. which will install Native TRD Driver on host which works with latest kernel.

If you want driver as part of GPU Operator then I would recommend to wait to hear from GPU Operator team.

But that will install a driver on the host itself, right? I'd prefer to avoid installing anything on the machine and keep the driver in the cluster. For that, you're saying I'll need to wait for the GPU Operator team? If so, they've made it known they're working on a fix. Once the fix is in place, will the CNS playbooks need updating?

yeah if you look at the comment NVIDIA/gpu-operator#564 (comment)

so with latest kernel the current Operator fixed, will validate with CNS and then if it requires any changes will make the changes to CNS as well and let you know

@BHSDuncan CNS is updated with new Operator version, please check cns version: 11.3 and let us know.