Latest slurm-ohpc-22.05.10-2.1.ohpc.2.6.2.x86_64 lacks GPU NVML library
tuxwielder opened this issue · 5 comments
Looks like the gpu_nvml support has been lost in the recent build:
Working system:
[ ~]# rpm -ql slurm-ohpc-22.05.2-14.103.ohpc.2.6.x86_64 | grep gpu
/usr/lib64/slurm/acct_gather_energy_gpu.so
/usr/lib64/slurm/gpu_generic.so
/usr/lib64/slurm/gpu_nvml.so
/usr/lib64/slurm/gres_gpu.so
Failing system:
[ ~]# rpm -ql slurm-ohpc-22.05.10-2.1.ohpc.2.6.2.x86_64 | grep gpu
/usr/lib64/slurm/acct_gather_energy_gpu.so
/usr/lib64/slurm/gpu_generic.so
/usr/lib64/slurm/gres_gpu.so
Obviously this makes clusters fail that depend on NVML detection of their GPU Gres...
Thanks for the report, but the package slurm-ohpc-22.05.2-14.103.ohpc.2.6.x86_64
does not seem to be distributed by OpenHPC. The closest OpenHPC package would be slurm-ohpc-22.05.2-14.1.ohpc.2.6.x86_64
. Your package says 103
ours says 1
.
I am pretty sure we never distributed gpu_nvml.so
because that requires NVML being present in the build system and we do not have any Cuda related packages installed.
I am sorry, but I do not think this OpenHPC related.
Apologies, you are right of course. We rolled our own package because of the lacking pmix support and our compile environment allows for enabling nvml-support as well.
Would be nice to have some mechanism to add these to the installation based on the OHPC-RPM though. For the pmix-library this seems pretty simple (just copy the library to /usr/lib64/slurm) however gpu_nvml depends on being called from libslurmfull.so, which of course is present in the OHPC-RPM already but without gpu_nvml awareness.
If you have any ideas how to make it easier to add NVML support please open a pull request with the necessary changes. We are happy to take any improvement.
@tuxwielder can this be closed?