Azure/azhpc-images

Centos-HPC 7.9 cuda broken for NC24 instances

Closed this issue · 3 comments

Hi, we have a question about the best approach to get a working CUDA installation for the NC24 instances with the CentOS-HPC 7.9 image.

The NC24 (v1) instances come with 4 Tesla K80 GPUs (sm_37).

Contrary to NVIDIA's own documentation, which states that the 470 driver branch still supports sm_37, tye GPUs are not recognized on the CentOS-HPC 7.9 image with the 470.82.01 driver (image version 20220112):

$ nvidia-smi
No devices were found

A colleague of ours successfully installed CUDA 11.2 and the 460 series driver on the CentOS 7.6 image (which still works with the K80s). However, we noticed that the CentOS-HPC 7.6 image no longer receives updates, so we would like to avoid moving our entire cluster to it just for the GPU support on the NC24 instances.

We tried downgrading the CUDA version on the CentOS-HPC 7.9 image to 11.2.2 with driver 460.32.03 but it was a little tricky since there are a number of NVIDIA components and some yum excludes (relevant parts of the image setup: [1,2]).

We really need the NC24 v1 instances, and were wondering what you would suggest as the best way forward.

E.g. one thing that could make our life easier would be to release a CentOS-HPC 7.9 image without CUDA that we could use to install a version that works with the K80s.

Of course it would be even better if the CentOS-HPC 7.9 image could support them out of the box. I'm not a CUDA expert and I don't understand this apparent discrepancy with NVIDIA's documentation - I also read here that the K80 should be supported up to 470.103.01.
Perhaps it is something else about the current CentOS-HPC image 7.9 that prevents the GPUs from being seen by nvidia-smi?

[1] https://github.com/Azure/azhpc-images/blob/7a9c492621e081f9c3fa36b3f35c0a6ffffced52/common/install_nvidiagpudriver.sh
[2] https://github.com/Azure/azhpc-images/blob/7a9c492621e081f9c3fa36b3f35c0a6ffffced52/centos/centos-7.x/common/install_nvidiagpudriver.sh

cc @matt-chan

One issue we encountered when trying to uninstall cuda was this

# /usr/local/cuda/bin/cuda-uninstaller
ERROR: Uninstall manifests not found at expected location: /var/log/nvidia/.uninstallManifests/

Got it working on the 7.9 image. It came down to slurm being a bit heavy-handed.

If you don't specify the GPU count when scheduling your job, the only way that you get access to them is with root.

If you schedule the GPUs in the task as well, then slurm will give you access to the GPUs. Just have to make sure to use the --gpus option.

Thanks a lot @matt-chan for looking into this, and sorry for the false alarm (SelectType=select/cons_tres was new for me)!