sxs-collaboration/spectre

Scheduler requires mpi use on head node

Opened this issue · 4 comments

Bug reports:

The scheduler (spectre schedule) command checks that the executable can parse correctly the input file on the head node (this requires to use MPI on the head node). Once this is done, the submit script is generated and submission to the queue proceeds.

On our cluster Urania, however, we need to load an interactive mpi module to run it on the head node (and thus to allow the scheduler to validate the input file). However, this module prevents sbatch to submit jobs (which is the next step):

(env) guilara@urania02:/urania/.../RunDir> sbatch Submit.sh 
You cannot use 'sbatch' while the module impi-interactive is loaded!

We are not sure of other clusters will have similar problems.

I suggest to remove the lines where the input file is validated (

# Validate input file
). Or perhaps validate directly on the compute nodes before execution?

Expected behavior:

Current behavior:

Environment:

Add as an attachment $SPECTRE_BUILD_DIR/BuildInfo.txt or
add its contents here.

Feature request:

Component:

  • Code
  • Documentation
  • Build system
  • Continuous integration

Desired feature:

  • Detail 1
  • Detail 2
  • Detail 3

Detailed discussion:

As a quick workaround, let's add a --no-validate flag to skip validation. You can even use a config file to always pass this flag to the CLI automatically on the cluster (see spectre --help).

Running the validation on the compute node defeats its purpose, because you want to validate the input file at job submission so you don't have to wait until the job has gone through the queue only two have it fail with a syntax error in the input file. If you can't run an executable and sbatch with the same modules on the head node I'm not sure what to do. Can you run just the executable without mpirun on the headnode?

@nilsvu I like the idea of the flag. Unfortunately, the executable doesn't run on the head node even without mpirun (unless I load the interactive module ofc). Right, I understand that it defeats the purpose of validation if its done on the compute node. But not sure at the moment what could be another fix.

Ok can you add the --no-validate flag to work around this then? You can add it to Schedule.py and the scheduler_options.

Will do