PySlurm/pyslurm

slurm_job_batch_script can't return string

Autumn-Roy opened this issue · 12 comments

Details

  • Slurm Version:
  • 21.08.4
  • Python Version:
  • Python 3.6.8
  • Cython Version:
  • Cython (0.29.30)
  • PySlurm Branch:
  • pyslurm (21.8.4.0)
  • Linux Distribution:
  • Linux version 3.10.0-1062.el7.x86_64

Issue

When I try to get the batch_script by job().slurm_job_batch_script(jobid), it only returns 0 or -1, and print the scripts one the screen. But in the doc, a str will return here.
image
So, any problem on my usage? Or, can I get the script by other way?
image

Hi @Autumn-Roy

in the current implementation, indeed a slurm function is used that simply outputs the script to stdout and just returns the error-code of that function (0 or -1). Of course, the script should actually just be returned as a string - which isn't possible right now.

Whenever I find the time, I can go ahead and implement it though.

tazend commented

Hi @Autumn-Roy

I implemented the functionality to get the script as a string.
Could you test it out on your side with this branch?
(The branch is for slurm 21.08 - in case your slurm version upgraded to 22.05 in the meantime, you can take this branch instead)

Hi, @tazend
Thanks for upgrading the pyslurm. I will test the branch on my side. When I finish, the result will reply to you. Thank you again.

@tazend
I tried to get the script as a string from new branch. But it occurs an error:
Traceback (most recent call last): File "pyslurm_test.py", line 51, in <module> a = pyslurm.job().slurm_job_batch_script(job_id) File "pyslurm/pyslurm.pyx", line 2353, in pyslurm.pyslurm.job.slurm_job_batch_script ValueError: Error contacting slurmctld while fetching batch script.

tazend commented

Hi @Autumn-Roy

mh, that is weird - I can unfortunately not replicate this error on my side.
Seems like a communication error with the slurmctld. Do you perhaps have access to the logs of the slurmctld and can check whether it reports any errors?
Does it work if you execute this?:

scontrol write batch_script <jobid> -

I have pushed a change to this branch, so you should now see the exact error message and error code after rebuilding pyslurm with this.

Hi @tazend

I can‘t get the logs of slurmctld. Then I try to use scontrol command, and it works. The detail error code was pasted here:
image

tazend commented

Hi @Autumn-Roy

I tested it again on another cluster - but wasn't able to replicate the error you get.
I can only suspect that there might be something broken in your pyslurm installation, since the code pyslurm and scontrol use to fetch the batch-script is basically the same. Do other pyslurm functions work? For example, does the following work?

import pyslurm
j = pyslurm.job().find_id(job_id)

If not yet tried, could you create a seperate python virtual environment to install pyslurm? Like this:

python3 -m venv pyslurm-env
source pyslurm-env/bin/activate

cd pyslurm-src
pip install -r test_requirements.yml
scripts/build.sh

... fetch script again ...

When it builds, it shows you the detected slurm version, and most importantly the directory to the slurm shared lib that pyslurm links to (for example /usr/lib64).
In contrast, which lib does the scontrol command point to? You can check by doing: ldd $(which scontrol).
(scontrol links to libslurmfull.so instead of libslurm.so like pyslurm does, so this difference is correct)

If pyslurm finds the lib in /usr/lib64 for example and scontrol shows /usr/lib64/slurm, then everything should be fine.

Hi @tazend

I tried the way you metioned. But the same error occured.
image
image

Temporarily, I use scontrol write batch_script <jobid> - in my script.

tazend commented

@Autumn-Roy

do other functions like:

import pyslurm
j = pyslurm.job().find_id(job_id)

work though?
Another thing: Is the slurm cluster you are using perhaps part of a federation? (https://slurm.schedmd.com/federation.html)
You should be able to check with scontrol show federation / sacctmgr show federation, and it may show multiple slurmctld addresses if in a federation.
If not in a federation, is there perhaps a multi-cluster setup? You can check with sacctmgr show clusters

@tazend

I check the slurm cluster, it's not in a federation and not a multi-cluster.
image

also, the other functions about pyslurm is okay:
image

Because you guess there maybe something broken in pyslurm installtion, so I repeat the installation and show below:

> git clone -b feature/retrieve-batch-script-21.08 https://github.com/tazend/pyslurm.git
> cd pyslurm/
> pip3 install .

I tried pip3 install -r test_requirements.txt. But the error Could not find a version that satisfies the requirement pytest==7.1.2 was occurred. So I use pip3 install . instead.

tazend commented

Hi @Autumn-Roy

I was now able to track down the cause of this error.
The problem is, that this code will only work with slurm versions starting from 21.08.8

The reason for this is that in 21.08.8 a critical CVE was fixed regarding credential abuse by a potential malicious actor.
Therefore, the slurm devs made some heavy structural changes to the code regarding authentication/communication with client and server.
The error stems from the fact that the code in pyslurm for fetching the script uses the updated struct definitions - thus any slurmctld below 21.08.8 rejects communication.

I have created this branch for now which reverts to the struct definitions pre-CVE release and it should work now for you.

For the main 21.08 branch however the original code which only works on releases post-cve will be backported. Ideally, your cluster administrators should update to the latest patch for 21.08 (to also fix the potential security issues)
Or you may need to manually comment out the offending struct members before building 21.08 after the code is merged (you can check here to see the two which I commented out.)

Hi @tazend
Thank you very much! The new branch works now.