pkgw/slurm-rs

ABI Instability

multimeric opened this issue · 4 comments

I was excited to find this library, but of course it's disappointing to hear that it's no longer maintained.

I'm just wondering if you could offer some examples about the ABI instability. I note that pyslurm (https://github.com/PySlurm/pyslurm) also uses the ABI, and they seem reasonably successful, although the whole ABI isn't provided to users. One suggestion I've seen for bindgen generated libraries is heavy #cfg macro use to enable/disable certain struct fields based on the SLURM version. But it seems you already started doing that to some extent.

pkgw commented

It's been a while since I've looked at this, so I don't quite forget what specific issues I ran into. But they were generally things of the nature of changing struct fields and/or function signatures.

I'm sure that careful bindgen technique could at least get one's code to compile against a range of Slurm versions, but as far as I could see one would still have to rebuild everything when Slurm is updated, which is quite regularly on my cluster. That would be a pretty major issue, in my view, although depending on one's workflow one might reasonably feel otherwise.

My current thinking is that it would be better to provide a Rust API that operates under the hood by executing the Slurm programs and parsing their output, rather than trying to rely on the shared libraries. That would certainly not be my preferred approach in an abstract sense, but I believe that:

  1. Pretty much everything I need to do can be accomplished by invoking the Slurm programs (that's sort of true by definition since they're the chief way people interface with the system)
  2. The Slurm programs seem to have relatively stable CLIs and outputs
  3. I haven't used them myself, but the Slurm programs generally seem to have "parseable" output modes

Given all that, if the Slurm devs aren't aiming for a stable library ABI, I think it makes more sense to interface with the system through the programs rather than through the library. It's lame but I think it would be the right choice.

As it happens, I've started up on a new project at work that might end up with me resurrecting this crate and trying out this alternative approach. I've had a few ideas about large-scale computation management on clusters that I might want to try implementing and something like this crate would be the base layer for doing so.

but as far as I could see one would still have to rebuild everything when Slurm is updated, which is quite regularly on my cluster. That would be a pretty major issue, in my view, although depending on one's workflow one might reasonably feel otherwise.

This is true. I haven't encountered a cluster where the Slurm version changes fast enough for this to be a concern, but I guess it means that you would have to re-compile all the software you built using slurm-rs when this happens.

My current thinking is that it would be better to provide a Rust API that operates under the hood by executing the Slurm programs and parsing their output, rather than trying to rely on the shared libraries.

Probably true. I actually didn't know about --json and --yaml, but it may well be worth the performance hit in doing this in order to gain stability. Although won't it mean that you can't use fixed structs to handle the Slurm types, if you don't know the version at compile time?

Actually it seems that older versions of Slurm with the --json flag don't support it very well. In particular, you can't apply any filters to the output. This seems to be fixed in the upcoming release, but very few users will support that: https://groups.google.com/g/slurm-users/c/nPMGuwH4N5o/m/U2ZonGgPBQAJ

pkgw commented

Hmm, yeah, that would be annoying. But overall my systems-engineering spidey-sense still tells me that I'd rather struggle with interfacing with the CLI tools (just parse non---json output if you have to) than lock myself into dealing with an unstable ABI. I feel like a big advantage of Rust programs is that you can build/install them once and they'll keep on working for a long time, and my current best understanding is that if you link against a Slurm shared library you're pretty much guaranteed to get broken upon upgrade. Whereas the CLI stuff "should" be stable and forward-compatible, so once you get something working, it should stay working.

won't it mean that you can't use fixed structs to handle the Slurm types[?]

Yes, but I think that's better anyway. I think that a classical "builder" API, abstracting/hiding the details of the actual structs, is really what you want for this kind of functionality anyway. It's been a while but as I recall, my original idea was to precisely have this crate be the low-level interface providing direct access to the structs and then build that sort of builder API on top of it.

As a random thought, one could adopt a hybrid approach and provide a standalone Rust "agent" program that did link with libslurm, and provided whatever low-level access was needed that couldn't be reliably gotten by driving the CLI tools, and a separate crate that drives that program. That way, when libslurm's ABI changes under you, you only need to recompile the agent, and not every single tool that relies on it.

(Implicit in all this: I have trouble seeing a use case where the performance impact of exec'ing a child vs. using the shared library directly would matter.)