precice/micro-manager

Handle crashing of micro simulations in a proper way

IshaanDesai opened this issue · 4 comments

Currently when the Micro Manager is controlling and running micro simulations, if one simulation crashes or has an improper exit, the Micro Manager run just hangs. It is nearly impossible to know which micro simulation crashed, and why the overall execution has hanged.

The Micro Manager should be able to handle a simulation crash by ...

  • ... continuing to run the rest of the micro simulations.
  • ... logging the simulation crash in the log file.
  • ... providing the macro location at which the simulation crashed.
  • ... creating a new log file and parsing the error output from the crashed simulation to it.

It is not clear if all these things could be done, so initially some investigations are necessary.

Ideally, the complete simulation could continue to run, by replacing the result of the failing micro simulation by a similar simulation.
Technically, this should be handled by catching exceptions.

How could it be decided what simulation is similar enough? Especially in the case of adaptivity, as I understand it, simulations that are similar to each other rely on one to run, and if it hangs or crashes, these inactive simulations would not help resolve the crash/hang.
Should this crashed simulation rely on data from another simulation from the point of failure until the end of the simulation, or should it attempt to restart somehow?
As the macro simulation can rely on data from all micro simulations, it only makes sense to continue a run if all micro simulations provide data or the necessary data is provided by other means, right? Continuing an incomplete simulation would run into more problems. Thus, the simulation could also run into cases when it might be best to abort the entire solving process if continuing the complete simulation is unreasonable due to a lack of a similar simulation to replace the crashed one with. In such a case, is there a way to let the other participant know that the simulation has been stopped early and finalize the incomplete simulation properly?

Especially in the case of adaptivity, as I understand it, simulations that are similar to each other rely on one to run, and if it hangs or crashes, these inactive simulations would not help resolve the crash/hang.

It could be the next similar one.

Should this crashed simulation rely on data from another simulation from the point of failure until the end of the simulation, or should it attempt to restart somehow?

No restart. What we had in mind were cases failing in a deterministic way. So, restart should fail again.

Thus, the simulation could also run into cases when it might be best to abort the entire solving process if continuing the complete simulation is unreasonable due to a lack of a similar simulation to replace the crashed one with.

Agree

is there a way to let the other participant know that the simulation has been stopped early and finalize the incomplete simulation properly?

Currently not, but under discussion in precice/precice#1118

Resolved via #85