Better HPC interfacing
Closed this issue · 1 comments
Julia's inbuilt parallelism isn't really good for launching complicated workflows / models on clusters. Would be useful to have something that can interface better to the scheduler.
Example:
- some of my simulations need to be run on multiple nodes, rather than a single cpu
- a decent amount of data processing is carried out before the model is run, and I run into memory issues
@FriesischScott has suggested a nice trick, where the solver calls sbatch
and waits for the job to complete:
OpenSees = Solver(
"sbatch",
"launch_sim.sh";
args="--wait",
)
with launch_sim.sh
containing everything need to run my workflow.
Nice, but, has some problems:
- To run parallel jobs, I need still need to
addprocs(N)
, and these N processors are idly waiting for jobs to finish. - My scheduler maxes at 20 jobs at the same time, so if I want to run many quick model evaluations, this is slow.
- Launching many jobs like this can quickly swamp the scheduler.
Slurm job arrays
Slurm's job arrays is a nice way to manage many similar jobs, like jobs which differ by just an index (e.g. sample-N). Also allows you to preallocate total amount of resources needed, and how many are concurrently run etc.
It would require a bit of engineering, but perhaps we could an interface of some sort could be written ... i.e. when pmap is called in External model, the input files are created (directories created, files copied, values interpolated), and a slurm array submitted which will loop through the individual jobs.
Maestro workflow manager
Could also be something to look into:
https://github.com/LLNL/maestrowf?tab=readme-ov-file
https://maestrowf.readthedocs.io/en/latest/index.html
Provides a yaml and command line tool for performing parameter studies, and can work with slurm.
Scheduling studies: https://maestrowf.readthedocs.io/en/latest/Maestro/scheduling.html#flux
If I understand corrently the whole point is to instead of having one job that runs all samples in parallel instead submit one job for each sample that is then parallelized, correct?