`legion::Logger::error/fatal` doesn't stop the task.
trivialfis opened this issue · 6 comments
Legion::Logger::error
and fatal
function calls don't stop the execution of a task. In addition, the legate
executable returns normally even if an error happens inside a task.
Also, is there any guidance on error handling inside tasks? For instance, what's the best way to indicate a task is not running correctly? Or is there a way to emit an error message explaining misuse by the caller?
Legion::Logger::error and fatal function calls don't stop the execution of a task.
This appears to have been a conscious decision by @lightsighter and @streichler, so I'll let them comment.
In addition, the legate executable returns normally even if an error happens inside a task.
I am not seeing this behavior; whether I insert a crash through an assertion or an uncaught exception, the exit code of the process is always 1:
~/cunumeric> LEGATE_TEST=1 legate --cpus 1 a.py
[0 - 7ff8548e9700] 0.000193 {4}{threads}: reservation ('Python-1 proc 1d00000000000003') cannot be satisfied
Assertion failed: (0), function cpu_variant, file matvecmul.cc, line 32.
Signal 6 received by node 0, process 33498 (thread 700001f4b000) - obtaining backtrace
Signal 6 received by process 33498 (thread 700001f4b000) at: stack trace: 16 frames
[0] = 0 libsystem_platform.dylib 0x00007ff81123d5ec _sigtramp + 28
[1] = 0 libsystem_kernel.dylib 0x00007ff8111d8203 __pthread_kill + 11
[2] = 0 libsystem_pthread.dylib 0x00007ff81120fee6 pthread_kill + 263
[3] = 0 libsystem_c.dylib 0x00007ff811136b45 abort + 123
[4] = 0 libsystem_c.dylib 0x00007ff811135e5e err + 0
[5] = 0 libcunumeric.dylib 0x00000001b119ca3b _ZN9cunumeric13MatVecMulTask11cpu_variantERN6legate11TaskContextE + 43
[6] = 0 liblgcore.dylib 0x0000000110c24a25 _ZN6legate6detail12task_wrapperEPFvRNS_11TaskContextEERKNSt3__112basic_stringIcNS5_11char_traitsIcEENS5_9allocatorIcEEEEPKvmSF_mN5Realm9ProcessorE + 181
[7] = 0 libcunumeric.dylib 0x00000001b119ce70 _ZN6legate10LegateTaskIN9cunumeric13MatVecMulTaskEE19legate_task_wrapperIXadL_ZNS2_11cpu_variantERNS_11TaskContextEEEEEvPKvmS8_mN5Realm9ProcessorE + 80
[8] = 0 librealm.1.dylib 0x000000011b14c048 _ZN5Realm18LocalTaskProcessor12execute_taskEjRKNS_12ByteArrayRefE + 1272
[9] = 0 librealm.1.dylib 0x000000011b1eb83b _ZN5Realm4Task20execute_on_processorENS_9ProcessorE + 891
[10] = 0 librealm.1.dylib 0x000000011b1f254c _ZN5Realm25KernelThreadTaskScheduler12execute_taskEPNS_4TaskE + 44
[11] = 0 librealm.1.dylib 0x000000011b1f0bd4 _ZN5Realm21ThreadedTaskScheduler14scheduler_loopEv + 1396
[12] = 0 librealm.1.dylib 0x000000011b1f15da _ZN5Realm21ThreadedTaskScheduler20scheduler_loop_wlockEv + 42
[13] = 0 librealm.1.dylib 0x000000011b204cec _ZN5Realm6Thread20thread_entry_wrapperINS_21ThreadedTaskSchedulerEXadL_ZNS2_20scheduler_loop_wlockEvEEEEvPv + 92
[14] = 0 librealm.1.dylib 0x000000011b208dc8 _ZN5Realm12KernelThread13pthread_entryEPv + 472
[15] = 0 libsystem_pthread.dylib 0x00007ff8112101d3 _pthread_start + 125
~/cunumeric> echo $?
1
~/cunumeric> LEGATE_TEST=1 legate --cpus 1 a.py
[0 - 7ff8548e9700] 0.000225 {4}{threads}: reservation ('Python-1 proc 1d00000000000003') cannot be satisfied
libc++abi: terminating due to uncaught exception of type std::runtime_error: something went wrong
Signal 6 received by node 0, process 33601 (thread 70000bd86000) - obtaining backtrace
Signal 6 received by process 33601 (thread 70000bd86000) at: stack trace: 21 frames
[0] = 0 libsystem_platform.dylib 0x00007ff81123d5ec _sigtramp + 28
[1] = 0 libsystem_kernel.dylib 0x00007ff8111d8203 __pthread_kill + 11
[2] = 0 libsystem_pthread.dylib 0x00007ff81120fee6 pthread_kill + 263
[3] = 0 libsystem_c.dylib 0x00007ff811136b45 abort + 123
[4] = 0 libc++abi.dylib 0x00007ff8111ca282 abort_message + 241
[5] = 0 libc++abi.dylib 0x00007ff8111bc3e1 _ZL28demangling_terminate_handlerv + 241
[6] = 0 libobjc.A.dylib 0x00007ff810e907d6 _ZL15_objc_terminatev + 104
[7] = 0 libc++abi.dylib 0x00007ff8111c96db _ZSt11__terminatePFvvE + 6
[8] = 0 libc++abi.dylib 0x00007ff8111cbfa7 __cxa_get_exception_ptr + 0
[9] = 0 libc++abi.dylib 0x00007ff8111cbf6e _ZN10__cxxabiv1L22exception_cleanup_funcE19_Unwind_Reason_CodeP17_Unwind_Exception + 0
[10] = 0 libcunumeric.dylib 0x00000001aadd19f8 _ZN9cunumeric13MatVecMulTask11cpu_variantERN6legate11TaskContextE + 72
[11] = 0 liblgcore.dylib 0x000000010a858a25 _ZN6legate6detail12task_wrapperEPFvRNS_11TaskContextEERKNSt3__112basic_stringIcNS5_11char_traitsIcEENS5_9allocatorIcEEEEPKvmSF_mN5Realm9ProcessorE + 181
[12] = 0 libcunumeric.dylib 0x00000001aadd1e50 _ZN6legate10LegateTaskIN9cunumeric13MatVecMulTaskEE19legate_task_wrapperIXadL_ZNS2_11cpu_variantERNS_11TaskContextEEEEEvPKvmS8_mN5Realm9ProcessorE + 80
[13] = 0 librealm.1.dylib 0x0000000114d80048 _ZN5Realm18LocalTaskProcessor12execute_taskEjRKNS_12ByteArrayRefE + 1272
[14] = 0 librealm.1.dylib 0x0000000114e1f83b _ZN5Realm4Task20execute_on_processorENS_9ProcessorE + 891
[15] = 0 librealm.1.dylib 0x0000000114e2654c _ZN5Realm25KernelThreadTaskScheduler12execute_taskEPNS_4TaskE + 44
[16] = 0 librealm.1.dylib 0x0000000114e24bd4 _ZN5Realm21ThreadedTaskScheduler14scheduler_loopEv + 1396
[17] = 0 librealm.1.dylib 0x0000000114e255da _ZN5Realm21ThreadedTaskScheduler20scheduler_loop_wlockEv + 42
[18] = 0 librealm.1.dylib 0x0000000114e38cec _ZN5Realm6Thread20thread_entry_wrapperINS_21ThreadedTaskSchedulerEXadL_ZNS2_20scheduler_loop_wlockEvEEEEvPv + 92
[19] = 0 librealm.1.dylib 0x0000000114e3cdc8 _ZN5Realm12KernelThread13pthread_entryEPv + 472
[20] = 0 libsystem_pthread.dylib 0x00007ff8112101d3 _pthread_start + 125
~/cunumeric> echo $?
1
Also, is there any guidance on error handling inside tasks? For instance, what's the best way to indicate a task is not running correctly? Or is there a way to emit an error message explaining misuse by the caller?
Most user-caused (and thus recoverable) error conditions should get detected and reported before worker tasks gets launched. We typically use standard python exceptions to report such errors (ValueError
, TypeError
, ...), which the user code could ostensibly catch and recover from. The Legate library writer is responsible for checking and sanitizing inputs as much as possible, before they make it to the task body.
Sometimes a problematic input is only detected as part of running the computation (e.g. during a linear solve it is discovered that the matrix is singular). In that case the task should throw a special legate::TaskException
, which will get caught and translated to a corresponding exception on the calling side, that the user could presumably catch and recover from. However there are a lot of caveats with this system, so it's preferable to catch user errors before the launch if possible.
Any remaining errors are internal errors, that the user cannot do anything about, and should just terminate the execution. You can either report an error on the appropriate logger then call the LEGATE_ABORT
macro, or throw any exception besides legate::TaskException
. I believe we are slowly transitioning to the latter.
Note that this is just my interpretation of our current practices, since this is not officially documented anywhere (and that's something we should fix). Inviting @magnatelee and @jjwilke to comment further.
This appears to have been a conscious decision by @lightsighter and @streichler, so I'll let them comment.
That's correct. Logging infrastructure should not dictate how error handling is performed. The Realm logging infrastructure will report things at different levels including the error level, but it will not automatically error out your application. Error handling is the responsibility of the client. Legion too doesn't specify how you should have an error happening inside your tasks for the same reason: we don't want to force you into handling errors a particular way.
Actually, Realm has a change in the works that will result in fatal
messages terminating the application:
https://gitlab.com/StanfordLegion/legion/-/commit/4735e033938ae0ef295910977e30b2d0900ac087
The application has the ability to register a callback that Realm will call before terminating the application
(e.g. to save user data), but there's no way to "soldier on" from there.
The fatal
message level though is different than the error
message level right?
Correct. Realm uses error
when it's going to continue operation, but is pretty certain that the application
is not going to produce the intended result.
I am not seeing this behavior; whether I insert a crash through an assertion or an uncaught exception, the exit code of the process is always 1:
Ah, I meant an error message is emitted through the logger.
In that case the task should throw a special legate::TaskException
Thank you for sharing.
Actually, Realm has a change in the works that will result in fatal messages terminating the application:
Nice!
Thank you all for the informative replies! I will close this issue now that it's clear.