NVIDIA/JAX-Toolbox

Need to cancel orphaned SLURM jobs

yhtang opened this issue · 4 comments

yhtang commented

Currently, if a multi-node CI job submits a SLURM job and then gets canceled while waiting for the SLURM job to run, the SLURM job will become orphaned: it will be left in the queue to run, thus wasting computing resources, even though its result will not be collected. To fix this, we need to:

  1. Keep track of the submitted SLURM job id, e.g. using a step output variable
  2. Define a cancellation step that runs if the job is canceled using the if: canceled () condition to properly remove the SLURM job.

This will help to reduce the congestion in our CI cluster, especially when repeated updates to a PR cause the SLURM queue to be flooded with orphaned CI jobs.

I don't know what signal gets sent, but would a simple trap be less complex solution?

yhtang commented

I don't know what signal gets sent, but would a simple trap be less complex solution?

Do you mean a trap in the SLURM job script? How does the SLURM job get the signal (thus triggering the trap) when the Actions job that created it was canceled, e.g. via the web GUI or REST API?

I'm working on a solution that adds something like this immediately after each sbatch job step:

      - name: Remove orphaned SLURM job if the CI job is canceled
        if: cancelled()
        shell: bash -x -e {0}
        run: |
          ssh ${{ secrets.CLUSTER_LOGIN_USER }}@${{ vars.HOSTNAME_SLURM_LOGIN }} \
            scancel ${{ steps.submit.outputs.SLURM_JOB_ID }}

Do you think there might be a simpler solution?

https://docs.github.com/en/actions/managing-workflow-runs/canceling-a-workflow#steps-github-takes-to-cancel-a-workflow-run

For steps that need to be canceled, the runner machine sends SIGINT/Ctrl-C to the step's entry process (node for javascript action, docker for container action, and bash/cmd/pwd when using run in a step). If the process doesn't exit within 7500 ms, the runner will send SIGTERM/Ctrl-Break to the process, then wait for 2500 ms for the process to exit. If the process is still running, the runner kills the process tree.

so the bash script could catch SIGINT and cancel the Slurm job?

Yea, what @olupton is describing is what I had in mind. I hadn't tested this, but what I had in mind was something like:

sshx sbatch ....
JOB_ID=$(...)
trap "sshx scancel $JOB_ID" SIGNINT
block_until_job_is_finished