Implement cluster sidecar
leoisl opened this issue · 3 comments
See snakemake
new feature: snakemake/snakemake#1397
This is a more efficient and better way to query for job statuses than bjobs <jobid>
, but rather complex to implement. An implementation for slurm cluster can be found here: https://github.com/holtgrewe/snakemake-profiles-slurm/blob/slurm-sidecar/%7B%7Bcookiecutter.profile_name%7D%7D/slurm-sidecar.py and can be used as base
I'm guessing the best way to implement this would be something like bjobs -o 'jobid stat' -noheader -a
which outputs the status of all jobs for the user
Example
2510890 RUN
2509904 RUN
2637332 RUN
2637541 EXIT
2637554 EXIT
2637542 DONE
2637537 DONE
2637527 DONE
2637539 DONE
and then this just gets kept in a dict
.
The Slurm profile polls this every 60 seconds.
The only thing we have to test out is whether there is a line limit to this. For example, if I have 1500 jobs running, do I get all of them listed? I assume so, but will need to test this.
@leoisl can you think of a better way of doing this?
One thing that might slow this down is if the job has disappeared from the bjobs menu and we have to go searching in the log... I wonder if speaking with systems could be useful here?
Another option here is to watch
the bjobs
command and just change the interval from the default 2 seconds to something more reasonable like 30-60 seconds.
Looking at the slurm profile status-checker and the snakemake docs it looks like the sidecar needs to start some kind of server. The sidecar should output a single line (in the case of a REST server, this line could be the port the server is listening on and any credentials). This line is subsequently provided to the --cluster-status
and --cluster-cancel
commands. The key is checking whether the environment variable SNAKEMAKE_CLUSTER_SIDECAR_VARS
is set, and if so, checking the status by polling the server. See https://github.com/Snakemake-Profiles/slurm/blob/8ee65d648e502beba406059e2a2d026110d38b9a/%7B%7Bcookiecutter.profile_name%7D%7D/slurm-status.py#L56-L71