powsybl/powsybl-hpc

[Slurm] Job execution should not be dependent on the availability of Slurm DB

Closed this issue · 1 comments

  • Do you want to request a feature or report a bug?

Ambiguous.

  • What is the current behavior?

When sacct is unavailable, for example because of problems communicating with Slurm DB, the submission of jobs fails.

  • What is the expected behavior?

Slurm is able to carry out jobs execution even if Slurm DB is not responding. Slurm DB provides only an accounting feature and must not be critical. That's why at worst, some accounting information will be lost if Slurm DB is not available, but jobs execution will go on.

In the same way, the SlurmComputationManager should be able to continue to execute jobs when the Slurm DB is not available.

In order to achieve this, we must remove the use of sacct command in particular. We must investigate if other slurm commands also depend on the availability of the DB (scontrol).
We may remove or improve some of the checks performed at the inilization of the component (sacct --help etc), to not perform unnecessary calls to slurm commands.

  • What is the motivation / use case for changing the behavior?

Better availability of the computation service.

Fixed by #52