openhpc/ohpc

Latest slurm-ohpc-22.05.10-2.1.ohpc.2.6.2.x86_64 lacks GPU NVML library

tuxwielder opened this issue · 5 comments

Looks like the gpu_nvml support has been lost in the recent build:

Working system:
[ ~]# rpm -ql slurm-ohpc-22.05.2-14.103.ohpc.2.6.x86_64 | grep gpu
/usr/lib64/slurm/acct_gather_energy_gpu.so
/usr/lib64/slurm/gpu_generic.so
/usr/lib64/slurm/gpu_nvml.so
/usr/lib64/slurm/gres_gpu.so

Failing system:
[ ~]# rpm -ql slurm-ohpc-22.05.10-2.1.ohpc.2.6.2.x86_64 | grep gpu
/usr/lib64/slurm/acct_gather_energy_gpu.so
/usr/lib64/slurm/gpu_generic.so
/usr/lib64/slurm/gres_gpu.so

Obviously this makes clusters fail that depend on NVML detection of their GPU Gres...

Thanks for the report, but the package slurm-ohpc-22.05.2-14.103.ohpc.2.6.x86_64 does not seem to be distributed by OpenHPC. The closest OpenHPC package would be slurm-ohpc-22.05.2-14.1.ohpc.2.6.x86_64. Your package says 103 ours says 1.

I am pretty sure we never distributed gpu_nvml.so because that requires NVML being present in the build system and we do not have any Cuda related packages installed.

I am sorry, but I do not think this OpenHPC related.

Apologies, you are right of course. We rolled our own package because of the lacking pmix support and our compile environment allows for enabling nvml-support as well.

Would be nice to have some mechanism to add these to the installation based on the OHPC-RPM though. For the pmix-library this seems pretty simple (just copy the library to /usr/lib64/slurm) however gpu_nvml depends on being called from libslurmfull.so, which of course is present in the OHPC-RPM already but without gpu_nvml awareness.

If you have any ideas how to make it easier to add NVML support please open a pull request with the necessary changes. We are happy to take any improvement.

@tuxwielder can this be closed?