rivosinc/prometheus-slurm-exporter

Docker container

Closed this issue · 3 comments

Hello

I'm happy to see further development of a slurm prometheus exporter since we use Slurm v23.x, which is not supported by https://github.com/vpenso/prometheus-slurm-exporter. I know it's at an early state, but could you perhaps provide a list of supported slurm versions somewhere? And secondly a working docker container or just a Dockerfile? That would be the most ideal and convenient in my opinion. I'm happy to contribute at some point, but I won't have time before next year.

Howdy! I love the idea of everyone building and running their own docker container. The reason that's on pause is because it's very platform dependent based on the type of authentications users use with their slurm cluster and the platform they plan to run on. The more time I spent on it, the less I thought it was worth it because I haven't gotten any complaints about deployment. I figured go install was a simple enough route. I reckon when we add slurmrestd support, making a docker container will be trivial, anyway. That's why I had a musing, but never dedicated too much time to it. If you have any ideas on how to containerize, with munge auth, and slurm. I'd love to hear them. Maybe it's easier than I think. Off course, contributions are very welcome.

In terms of the supported slurm versions, we only know that it works for 21.XX and 23.XX, and we recently fixed a bug for 22 as well #26 , apart from that we aren't sure as the --json output is iterated on qiute often by SChedMD. But our cli fallback probably works for everything above 18.XX, I think (That's when the TRes last changed. I'm not super confident about that. Once again, was kind of just waiting for a complaint. and a version number. I know that's not a satisfying answer, but I haven't had a chance to test it on a bunch of slurm versions, or go through the change list and see when sinfo/squeue output formats have changed.

Thanks again!

Hi again. Alright. You could maybe in the code do a slurm version check initially and use separate implementations depending on it, but up to you. Reg Docker I'm willing to give it a go with bundling up a container at some point when I get to it. Right now I'm adapting the nvidia/deepops repo to set up slurm and it simply mounts slurm binaries and the munge key into the container directly from the host, see https://github.com/NVIDIA/deepops/blob/d248b658321eaf2d1adb8bc88fcb8408b48802e2/roles/prometheus-slurm-exporter/templates/docker.slurm-exporter.service.j2#L12. Similar to the "original" slurm exporter https://github.com/dholt/prometheus-slurm-exporter. To me a container is also ideal because the teardown/uninstall of it doesn't leave anything behind on the host. And of course nice with locked versions also of all dependencies. Anyways I'll gladly do a PR at some point if you don't get to it before me.

I attempted a docker container for this repository and here are the main issues I came up against

  • To run the Slurm commands (squeue) the Slurm user needs to have matching uid inside the container as on the host. A workaround is to do a configless deployment and run slurmd as systems service within the container and point it to your Slurm controller for the config.

  • Users on our clusters are defined through an LDAP service and through Google Cloud's OS-Login service. Adding sssd to the container image and getting the launched container to use LDAP was fine, but for our cluster on GCP, getting OS-Login into the container proved challenging. If these extra bits are not are part of the deployed container, the users reported by Slurm commands within the container are unknown.

This being said, deploying without a container was far easier.