CLIP-HPC/SlurmCommander

Slurm Commander using slurm 23.02.0-0rc1

shawnhoopes opened this issue · 6 comments

Using the latest RC build of Slurm 23.02
scom not working

Thanks!

ERROR: sacct JSON failed to parse, note your slurm version and open an issue with us here: https://github.com/CLIP-HPC/SlurmCommander/issues/new/choose

Hey,
thanks for the feedback.
Haven't got around to testing sc with slurm 23, now i see there is some polishing to be done 😄
Will let you know as soon as i have a fix.

A slight correction from my side on the "some polishing" estimate i gave yesterday is in order.
I've just spun up a 23.02.0 release cluster and the amount of changes SchedMD introduced in their openapi schemas is absolutely staggering.
Right now i'm typing this in a minor state of shock 😆
Might be that i'm overestimating and i'll pull myself to a more rational state after the initial "surprise" subsides.

First impression:
I'm having problems seeing how to reconcile the existing code (slurm 21,22 compatible) with the new 23 one and make them play together. It is of course possible, but it makes sense only if it's achievable with a reasonable amount of code refactoring.
In any case, I'll definitely need some time to understand all the changes SchedMD introduced, and see if there is a way to make scom support legacy and 23.x slurm.

Conclusion: support for slurm 23.x might take a while, if it ever happens.

Will keep this issue open and report any progress i might have here.

If anyone reading this issue is interested in helping out, by code or advice, please do not refrain from doing so 😃

Right now i'm typing this in a minor state of shock
Might be that i'm overestimating and i'll pull myself to a more rational state after the initial "surprise" subsides.

Took a quick look at the way scom queries the data from Slurm after @shawnhoopes pointed out this issue. Slurm's commands always dumps the latest format (i.e. v0.0.40 in Slurm-23.11, v0.0.39 in Slurm-23.02) without built-in backward compatibility. Slurm's slurmrestd does not have this limitation and provides versioned queries to avoid changes in the response from breaking existing clients. slurmrestd maintains prior content plugins for at least 2 releases to give developers of clients such as this more time to port (if needed). I fully understand the code already exists, but I suggest looking into this for future releases. If requiring a standing service is an issue, slurmrestd supports taking queries directly via stdio without running a daemon.

As for porting to the new 23.02 formatting. I noticed that sinfo is called and would have the most drastic changes in content. As part of a project to condense and implement built-in filtering, the sinfo response is now very different and now more closely matches formatted output that can be queried via the non-json output. That prior content was not lost, though, it was moved to scontrol show nodes --json. Due to limitations of several OpenAPI client generators, many numeric outputs were changed to be a dictionary with infinite, number, and set as fields with the number field having the same value as prior formatting. While Slurm can handle overloaded JSON fields internally, it was determined while not being ideal, this was the best way to avoid fatal issues in generated OpenAPI clients. This project's parsers could likely benefit from checking the meta data included in the JSON output to ensure the correct parsers are being used internally to allow a single release of scom to support multiple Slurm releases.

I suggest against grabbing the openapi.json from inside of the Slurm source directly but instead use slurmrestd to generate this with the wanted content plugins via -s v0.0.39,dbv0.0.39. This will avoid having prior versioned paths, the need to correct the URL paths to include the server path, and will work with v0.0.39+ content versions. Slurm uses a new model of generating OpenAPI schemas in v0.0.39+ instead of hand maintaining openapi.json files which has been unnecessarily error-prone.

Hopefully, this helps with your porting efforts as it would be a shame to have this project get abandoned. I also suggest taking a look at the new outputs now provided in Slurm-23.02 as more information tabs may be possible.

Gentlemen,

Thank you very much for your support and this effort to help.
You've provided me with concise, invaluable answers to a lot the questions that would take me some time to unravel, if at all.

I fully understand the code already exists, but I suggest looking into this for future releases. If requiring a standing service is an issue, slurmrestd supports taking queries directly via stdio without running a daemon.

The only rationale behind using commands instead of slurmrestd was the reach of scom.
Logic was: "Every slurm user has them, but not everyone has slurmrestd set up on their site."
As much as i would love to do the full switch to rest (it would make my life infinitely easier), i'm still a bit hesitant to cut cmds as i feel it might cut quite some users out. Ideally, i'd like to support both, keeping cmd as a fallback for users whose site admins do not provide restd.

Does this make sense from your perspective? I'd love to hear your opinion on this.

As for porting to the new 23.02 formatting. I noticed that sinfo is called and would have the most drastic changes in content.

You're right about sinfo, that was unfortunately the first command that i took to try out, and what left me bewildered :)

That prior content was not lost, though, it was moved to scontrol show nodes --json

Could you elaborate a bit on the relation between rest endpoints and commands like scontrol example above? Is there a 1:1 mapping between them?

I suggest against grabbing the openapi.json from inside of the Slurm source directly but instead use slurmrestd to generate this with the wanted content plugins via -s v0.0.39,dbv0.0.39.

Tried this now with 23, will continue to do so.

This project's parsers could likely benefit from checking the meta data included in the JSON output to ensure the correct parsers are being used internally to allow a single release of scom to support multiple Slurm releases.

Also the rest of your technical clarifications and suggestions were brilliant, i'll try to align the development accordingly.

Summary:

I've started work on porting, bellow is a more technical description of some ideas and problems i'm struggling to achieve, probably will hit some roadblocks here and there and in that case i'll take the liberty to ask for more assistance if that is ok with you?


Problems and ideas:

Problem 1: fetching data: cmd vs rest

Try:

  1. "hide" data fetching behind an interface, so both cmd and rest fetching routines exist
  • users/admins will be able to specify rest endpoint and token in the scom.conf and if that is not found, scom fallbacks to cmd
  • +automatic regen of token

Problem 2: differences between versions (e.g. 21 vs 23)

  • Example: clustertab table shows two columns, from fields: State and StateFlags, 23+ there is only one coming in.

Still unclear how to best handle:
Checking meta data from JSON gives me the version of struct to use when unmarshalling, but i'm still left with the table code which needs to account for the possibility that some fields are missing or changed the type (i remember seeing in some older version an int field becoming string)
(this is in cmd case, for rest case i think i might get by with generated code)

Try?:

  1. slurm/json version detection
  2. Versioning of openapi.json and generated code will be split into packages according to versions (generated by querying slurmrestd)
  3. Table code?... switch version{ case X,Y....}? or manually add methods to each package for fetching certain fields?
timeu commented

Some additional information regarding REST because we are currently implementing this.

1.) SLURM Rest can be also started in so called Inet mode by regular users (https://slurm.schedmd.com/rest.html#inet). So maybe we can start a slurmrestd process (which is simply a HTTP wrapper/webserver) when we start scom. Additionally it can work with local authentication if it's in the same munge perimeter (see slide 9 here: https://slurm.schedmd.com/PEARC20/REST_API.pdf). SLURM Rest (AFAIK) is supported from 20.x on. The only caveat is that SLURM Rest has to be explicitly enabled during build/compile time. So some of the sites might not have it :-(

2.) One additional benefit of having SLURM Rest is that SchedMD recommends to set up a Caching proxy and I guess most sites will such a setup. We have setup an nginx caching proxy that does TLS + caching of the REST calls (5 minutes cache time and url + SLURM user http header is used as caching key). This will help not to overload the slurmctld/slurmdbd and might be good enough to avoid having to develop a scom caching service/process

I fully understand the code already exists, but I suggest looking into this for future releases. If requiring a standing service is an issue, slurmrestd supports taking queries directly via stdio without running a daemon.

The only rationale behind using commands instead of slurmrestd was the reach of scom. Logic was: "Every slurm user has them, but not everyone has slurmrestd set up on their site." As much as i would love to do the full switch to rest (it would make my life infinitely easier), i'm still a bit hesitant to cut cmds as i feel it might cut quite some users out. Ideally, i'd like to support both, keeping cmd as a fallback for users whose site admins do not provide restd.
Does this make sense from your perspective? I'd love to hear your opinion on this.

Each site has the choice as to which binaries they provide. It is up to each project to decide prerequisites and what levels of compatibility they will provide.

Could you elaborate a bit on the relation between rest endpoints and commands like scontrol example above? Is there a 1:1 mapping between them?

In Slurm-22.05, there was a one-to-one mapping for the commands that got implemented. They actually called the same plugin code. In Slurm-23.02, we did a major redesign and split out the code that did the conversion from Slurm's internal data structures to JSON/YAML. This allows us to add the conversion to JSON all over the place without needing to actually touch any of the openapi/slurmrestd code directly along with several performance improvements. The resultant JSON for each internal structure will be the exactly same but the wrapping around it has changed. Right now, the only changes visible would be in the meta response which now provide info about the CLI command called instead of the OpenAPI plugin information.

I've started work on porting, bellow is a more technical description of some ideas and problems i'm struggling to achieve, probably will hit some roadblocks here and there and in that case i'll take the liberty to ask for more assistance if that is ok with you?
I can provide some input on this but others are design choices for any given project.

  1. "hide" data fetching behind an interface, so both cmd and rest fetching routines exist
  • users/admins will be able to specify rest endpoint and token in the scom.conf and if that is not found, scom fallbacks to cmd
  • +automatic regen of token

I think this is a design choice for your project of how much compatibility is important vs the complexity of the implementation.

Problem 2: differences between versions (e.g. 21 vs 23)

  • Example: clustertab table shows two columns, from fields: State and StateFlags, 23+ there is only one coming in.

Still unclear how to best handle: Checking meta data from JSON gives me the version of struct to use when unmarshalling, but i'm still left with the table code which needs to account for the possibility that some fields are missing or changed the type (i remember seeing in some older version an int field becoming string) (this is in cmd case, for rest case i think i might get by with generated code)

Try?:

  1. slurm/json version detection
  2. Versioning of openapi.json and generated code will be split into packages according to versions (generated by querying slurmrestd)
  3. Table code?... switch version{ case X,Y....}? or manually add methods to each package for fetching certain fields?

It might be easier to use some kind of templated code to handle the conversion of these items but I'm not familiar enough with GO to give suggestions. The provided examples of state flags will not have as easy of conversion because there are actually a set of new flags that are available in 23.02.

1.) SLURM Rest can be also started in so called Inet mode by regular users (https://slurm.schedmd.com/rest.html#inet). So maybe we can start a slurmrestd process (which is simply a HTTP wrapper/webserver) when we start scom. Additionally it can work with local authentication if it's in the same munge perimeter (see slide 9 here: https://slurm.schedmd.com/PEARC20/REST_API.pdf). SLURM Rest (AFAIK) is supported from 20.x on. The only caveat is that SLURM Rest has to be explicitly enabled during build/compile time. So some of the sites might not have it :-(

I would suggest against scom starting a slurmrestd daemon as that could inadvertently cause a security issue without extra steps to ensure the sockets are only visible to scom only. Using inet mode naturally avoids any of those issues:
echo -e 'GET http://localhost/slurm/v0.0.39/jobs HTTP/1.1\r\n'|slurmrestd

2.) One additional benefit of having SLURM Rest is that SchedMD recommends to set up a Caching proxy and I guess most sites will such a setup. We have setup an nginx caching proxy that does TLS + caching of the REST calls (5 minutes cache time and url + SLURM user http header is used as caching key). This will help not to overload the slurmctld/slurmdbd and might be good enough to avoid having to develop a scom caching service/process

Caching proxies are really effective to reduce the load on controllers and adding TLS when clients may be outside of the tightly controlled cluster network. The newly added rate limiting the controllers can also help with this: https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enable