Azure/az-hop

`sacct` does not work on `ondemand` with cc-slurm 3.x

Closed this issue · 1 comments

Version

1.0.40

In what area(s)?

/area administration
/area ansible
/area autoscaling
/area configuration
/area cyclecloud
/area documentation
/area image
/area job-scheduling
/area monitoring
/area ood
/area remote-visualization
/area user-management

Expected Behavior

sacct should work on the ondemand node

Actual Behavior

$ sacct
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

Steps to Reproduce the Problem

install az-hop with cc-slurm 3.x and slurm 23.x

Solution

The problem is that /anfhome/slurm/config/accounting.conf is configured to point to localhost:

AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost="localhost"
AccountingStorageTRES=gres/gpu

However, slurmdbd only runs on the scheduler node (sacct works fine there).

To fix, change localhost to {{ scheduler.name }} from the config file.
(there used to be logic for this in the slurm.conf.j2 template, but it seems this is no longer used with cc-slurm 3.x)

I've open a bug in CC Azure/cyclecloud-slurm#215
Working on a workaround