Need to cancel orphaned SLURM jobs
yhtang opened this issue · 4 comments
Currently, if a multi-node CI job submits a SLURM job and then gets canceled while waiting for the SLURM job to run, the SLURM job will become orphaned: it will be left in the queue to run, thus wasting computing resources, even though its result will not be collected. To fix this, we need to:
- Keep track of the submitted SLURM job id, e.g. using a step output variable
- Define a cancellation step that runs if the job is canceled using the
if: canceled ()
condition to properly remove the SLURM job.
This will help to reduce the congestion in our CI cluster, especially when repeated updates to a PR cause the SLURM queue to be flooded with orphaned CI jobs.
I don't know what signal gets sent, but would a simple trap
be less complex solution?
I don't know what signal gets sent, but would a simple
trap
be less complex solution?
Do you mean a trap
in the SLURM job script? How does the SLURM job get the signal (thus triggering the trap) when the Actions job that created it was canceled, e.g. via the web GUI or REST API?
I'm working on a solution that adds something like this immediately after each sbatch job step:
- name: Remove orphaned SLURM job if the CI job is canceled
if: cancelled()
shell: bash -x -e {0}
run: |
ssh ${{ secrets.CLUSTER_LOGIN_USER }}@${{ vars.HOSTNAME_SLURM_LOGIN }} \
scancel ${{ steps.submit.outputs.SLURM_JOB_ID }}
Do you think there might be a simpler solution?
For steps that need to be canceled, the runner machine sends SIGINT/Ctrl-C to the step's entry process (node for javascript action, docker for container action, and bash/cmd/pwd when using run in a step). If the process doesn't exit within 7500 ms, the runner will send SIGTERM/Ctrl-Break to the process, then wait for 2500 ms for the process to exit. If the process is still running, the runner kills the process tree.
so the bash script could catch SIGINT and cancel the Slurm job?