parthenon-hpc-lab/parthenon

Fatal errors in `Task`s are not propagated upward

Closed this issue · 8 comments

Was debugging an issue where a task wasn't executing, even when explicitly added to the task list, verified to be in the resulting list, etc. I could execute anything if it was marked depending on TaskID(0), but not if it depended on other tasks.

The problem was that a previous task was throwing a runtime_error. Error thrown, thread exits, thread pool starts up another without any pending tasks. pool.wait finishes, execution continues silently.

I don't imagine this is really what we intend, or tasks become very dangerous to use. We should check on the thread pool (however that looks with std::threads, I'm no expert) rather than just moving on when it dies with wait().

I think the solution here (assuming we stick with threads) is to fail rather than throw inside tasks.

Hmm, that's hard to enforce -- in this case, the runtime_error was thrown by Kokkos in its tremendously useful debugging mode. I'd hate to lose the ability to kill execution inside a Task, it would make them much less useful if I was weighing any code I put into them against the possibility of silent errors.

An alternative is to compile without the thread pool when debugging, and document that Parthenon will simply not respect a std::runtime_error in production.

Is there not a way to check whether threads exited cleanly or threw a fatal error?

I would just second that we need to fix this one way or the other. I spent quite a bit of time last week debugging an issue that was causing a PARTHENON_REQUIRE_THROWS to be called in a task, but the exception didn't get propagated back to the main thread and execution just continued.

Josh and I discussed this a little bit yesterday and I think we had settled on just removing PARTHENON*THROWS, but I don't think we had appreciated that Kokkos can throw as well.

Took a crack at this, and in some sense it's trivial since the tasks are already futures: all we need to do is collect all the futures and run .get() across all of them, and any exceptions will trigger (and as a bonus, the return values can be read too!).

However, this is easier said than done, since tasks are queued from worker threads (!) when the last task finishes. I've been playing with options, but the solution is going to have to be thread-safe, or involve rewriting how tasks are queued. Unless someone's solved it already, I'll take take a crack at doing the former, via a general mutexed collection of returns for any ThreadQueue, but I may have to write specialized ThreadQueue/ThreadPool for just functions that return TaskStatus if my template-fu is insufficient.

Additional confusing question:

How can I end up in a situation where a failed PARTHENON_REQUIRE_THROWS passes on restart (run on a single rank)?

Blocks assigned to rank 0: 0:288
Var: advected:1
[0] WARNING: Failed to read variable advected from restart file:
### PARTHENON ERROR
  Condition:   it->meshes.contains(mesh_record_name)
  Message:     Missing mesh record 'advected_lvl-2' in restart file.
  File:        /home/pgrete/src/parthenon/src/outputs/restart_opmd.cpp
  Line number: 125

# Variables in use:
# Package: parthenon::resolved_state
# ---------------------------------------------------
# Variables:
# Name	Metadata flags
# ---------------------------------------------------
my_derived_var            Cell,Provides,Real,Derived,OneCopy,advection_package,parthenon::resolved_state
advected                  Cell,Provides,Real,Independent,FillGhost,WithFluxes,advection_package,parthenon::resolved_state
# ---------------------------------------------------
# Sparse Variables:
# Name	sparse id	Metadata flags
# ---------------------------------------------------
# ---------------------------------------------------
# Swarms:
# Swarm	Value	metadata
# ---------------------------------------------------


Setup complete, executing driver...

Hello from the advection package in the advection example!
This run is a restart: 1

### PARTHENON ERROR
  Message:     History output for package "" requested but package is not available!

  File:        /home/pgrete/src/parthenon/src/outputs/history.cpp
  Line number: 77
History output for package "" requested but package is not available!

Backtrace:

#3  0x0000555555ac4780 in Kokkos::Impl::host_abort (message=message@entry=0x55555a0175e0 "History output for package \"\" requested but package is not available!\n")
    at /home/pgrete/src/parthenon/external/Kokkos/core/src/impl/Kokkos_Error.cpp:54
#4  0x00005555557437a0 in Kokkos::abort (message=0x55555a0175e0 "History output for package \"\" requested but package is not available!\n")
    at /home/pgrete/src/parthenon/external/Kokkos/core/src/impl/Kokkos_Error.hpp:203
#5  parthenon::ErrorChecking::fail (linenumber=77, filename=0x555555ad9078 "/home/pgrete/src/parthenon/src/outputs/history.cpp", 
    message=0x55555a0175e0 "History output for package \"\" requested but package is not available!\n") at /home/pgrete/src/parthenon/src/utils/error_checking.hpp:155
#6  parthenon::ErrorChecking::fail (message="History output for package \"\" requested but package is not available!\n", 
    filename=filename@entry=0x555555ad9078 "/home/pgrete/src/parthenon/src/outputs/history.cpp", linenumber=linenumber@entry=77)
    at /home/pgrete/src/parthenon/src/utils/error_checking.hpp:164
#7  0x00005555558747e1 in parthenon::HistoryOutput::WriteOutputFile (this=0x555559f314d0, pm=0x555555fef560, pin=0x555555ff0ad0, tm=0x7fffffffe228, signal=<optimized out>)
    at /home/pgrete/src/parthenon/src/outputs/history.cpp:77
#8  0x00005555557726a6 in parthenon::Outputs::MakeOutputs (this=0x555555ff1630, pm=0x555555fef560, pin=0x555555ff0ad0, tm=tm@entry=0x7fffffffe228, 
    signal=signal@entry=parthenon::SignalHandler::OutputSignal::none) at /home/pgrete/src/parthenon/src/outputs/outputs.cpp:444
#9  0x00005555556c9869 in parthenon::EvolutionDriver::Execute (this=this@entry=0x7fffffffe1f0) at /usr/include/c++/13.2.1/bits/unique_ptr.h:199
#10 0x00005555555d5fa6 in main (argc=4, argv=0x7fffffffe438) at /home/pgrete/src/parthenon/example/advection/main.cpp:50

Don't mind the error itself, I'm still fighting the restart, I just don't get why the code didn't die when processing the restart file.

Nevermind... I just realized that the first exception is actually caught.
Now, I'm not sure why we allow the code to continue and just issue a warning rather than failing if an expected variable was not able to be loaded from a restart file.

It looks like I never opened an issue about it, but I've been stung by Parthenon only warning on variables not found, too. IIRC this was partially because any read error (not just absence in the file) just generates the warning rather than a real error, which made it hard to debug restarting from face-centered fields, for example.

There are valid reasons to start with more Restart variables configured than are present in the restart file: in KHARMA for example, we'll evolve ideal MHD to some time, then switch to slower extended/viscous modeling once the run is far enough in that it matters. But, I think we should at least pare down the situations where this is allowed, e.g. only for restarts with a new input file, or only with some flag parthenon/job/restart_errors=warn set.