NVIDIA/JAX-Toolbox

Unique name and logfile for SLURM-launched A100 runners

yhtang opened this issue · 0 comments

yhtang commented

This workflow run exposed an issue with our current workflow: both JAX and Pallas unit test calls the _runner_ondemand_slurm.yaml workflow to create A100 runners. If two such calls happens in fast succession, they ended up creating two runners that may be scheduled by the SLURM cluster at the same time while having identical names (A100-${{ github_run_id }}), thus causing issue for the actual job to properly landed in the runner (more detail to be discovered here).

To fix potential conflicts between runners launched this way, the runner need to have different names, i.e. having a UUID as part of the name, etc.