adjtomo/seisflows

Multicore workstation system module

bch0w opened this issue · 4 comments

bch0w commented

Legacy SeisFlows had the ability to run workstation problems in an embarrassingly parallel fashion in its multicore system sub module.

https://github.com/adjtomo/seisflows/blob/legacy/seisflows/system/multicore.py

This feature did not make it over to modern SeisFlows, but would be very useful for those working on workstations, or for speeding up example problems.

Should be quick to overwrite the run function of theWorkstation submodule and run things using multiprocessing with a few additional parameters controlling max workers.

bch0w commented

Actually this is likely already taken care of by the Cluster class, which submits jobs to a local system using subprocess run and concurrent futures.

One small bug I noticed when looking over Cluster is that the max number of concurrent jobs (max_workers) should be ntask_max and not nproc - 1 which is what it is currently set as:

with ProcessPoolExecutor(max_workers=nproc() - 1) as executor:
futures = [executor.submit(self._run_task, run_call, task_id)
for task_id in range(ntasks)]

bch0w commented

Actually this is likely already taken care of by the Cluster class, which submits jobs to a local system using subprocess run and concurrent futures.

One small bug I noticed when looking over Cluster is that the max number of concurrent jobs (max_workers) should be ntask_max and not nproc - 1 which is what it is currently set as:

with ProcessPoolExecutor(max_workers=nproc() - 1) as executor:
futures = [executor.submit(self._run_task, run_call, task_id)
for task_id in range(ntasks)]

This bug was fixed in 646b6c1 and I was successfully able to run an embarrassingly parallel workstation example using the Cluster system module

Hello, sorry to bother you on another thread.... Can you please be a bit more specific with how you set up the parameter files to run seisflows this way. My parameter file looks like the following but when I try to run in a parallel fashion, seisflows sticks for a long time without doing anything.

system: cluster
ntask: 20
ntask_max: 20
walltime: 100:00:00
tasktime: 1000:00:00
nproc: 1
mpiexec: mpirun
log_level: DEBUG
verbose: False

bch0w commented

Hi @Ben-J-Eppinger, I think the issue is related to the fact that the parameter nproc in your parameter file needs to be >1, and should be equal to the number of parallel tasks you think your computer can handle at once. I usually set it equal or slightly below the total number of cores I have on the machine I'm using.

When nproc==1, the parallel machinery sets the total number of parallel processes to nproc-1=0 and that might explain why things are hanging.

Hopefully that works, but if the issue persists please feel free to open a new issue and we can try to get it resolved!

Edited: Sorry the above message was incorrect advice, I should have read my previous comments in this thread. It looks like you have set ntask_max correctly, which sets the total number of parallel processes.

This might take some digging: If you look at the main log file sflog.txt, the log messages in the log/ directory, or the Specfem log files in scratch/solver/mainsolver/OUTPUT_FILES/, is there any hint on where the jobs are hanging? This would help determine what is going wrong under the hood.

If SPECFEM crashes within a subprocess, it often does not send a kill signal to the master job, which allows things to hang indefinitely. I just ran into this recently.