Unique name and logfile for SLURM-launched A100 runners
yhtang opened this issue · 0 comments
yhtang commented
This workflow run exposed an issue with our current workflow: both JAX and Pallas unit test calls the _runner_ondemand_slurm.yaml
workflow to create A100 runners. If two such calls happens in fast succession, they ended up creating two runners that may be scheduled by the SLURM cluster at the same time while having identical names (A100-${{ github_run_id }}
), thus causing issue for the actual job to properly landed in the runner (more detail to be discovered here).
To fix potential conflicts between runners launched this way, the runner need to have different names, i.e. having a UUID as part of the name, etc.