nv-legate/legate.core

`legion::Logger::error/fatal` doesn't stop the task.

trivialfis opened this issue · 6 comments

Legion::Logger::error and fatal function calls don't stop the execution of a task. In addition, the legate executable returns normally even if an error happens inside a task.

Also, is there any guidance on error handling inside tasks? For instance, what's the best way to indicate a task is not running correctly? Or is there a way to emit an error message explaining misuse by the caller?

Legion::Logger::error and fatal function calls don't stop the execution of a task.

This appears to have been a conscious decision by @lightsighter and @streichler, so I'll let them comment.

In addition, the legate executable returns normally even if an error happens inside a task.

I am not seeing this behavior; whether I insert a crash through an assertion or an uncaught exception, the exit code of the process is always 1:

~/cunumeric> LEGATE_TEST=1 legate --cpus 1 a.py
[0 - 7ff8548e9700]    0.000193 {4}{threads}: reservation ('Python-1 proc 1d00000000000003') cannot be satisfied
Assertion failed: (0), function cpu_variant, file matvecmul.cc, line 32.
Signal 6 received by node 0, process 33498 (thread 700001f4b000) - obtaining backtrace
Signal 6 received by process 33498 (thread 700001f4b000) at: stack trace: 16 frames
  [0] = 0   libsystem_platform.dylib            0x00007ff81123d5ec _sigtramp + 28
  [1] = 0   libsystem_kernel.dylib              0x00007ff8111d8203 __pthread_kill + 11
  [2] = 0   libsystem_pthread.dylib             0x00007ff81120fee6 pthread_kill + 263
  [3] = 0   libsystem_c.dylib                   0x00007ff811136b45 abort + 123
  [4] = 0   libsystem_c.dylib                   0x00007ff811135e5e err + 0
  [5] = 0   libcunumeric.dylib                  0x00000001b119ca3b _ZN9cunumeric13MatVecMulTask11cpu_variantERN6legate11TaskContextE + 43
  [6] = 0   liblgcore.dylib                     0x0000000110c24a25 _ZN6legate6detail12task_wrapperEPFvRNS_11TaskContextEERKNSt3__112basic_stringIcNS5_11char_traitsIcEENS5_9allocatorIcEEEEPKvmSF_mN5Realm9ProcessorE + 181
  [7] = 0   libcunumeric.dylib                  0x00000001b119ce70 _ZN6legate10LegateTaskIN9cunumeric13MatVecMulTaskEE19legate_task_wrapperIXadL_ZNS2_11cpu_variantERNS_11TaskContextEEEEEvPKvmS8_mN5Realm9ProcessorE + 80
  [8] = 0   librealm.1.dylib                    0x000000011b14c048 _ZN5Realm18LocalTaskProcessor12execute_taskEjRKNS_12ByteArrayRefE + 1272
  [9] = 0   librealm.1.dylib                    0x000000011b1eb83b _ZN5Realm4Task20execute_on_processorENS_9ProcessorE + 891
  [10] = 0   librealm.1.dylib                    0x000000011b1f254c _ZN5Realm25KernelThreadTaskScheduler12execute_taskEPNS_4TaskE + 44
  [11] = 0   librealm.1.dylib                    0x000000011b1f0bd4 _ZN5Realm21ThreadedTaskScheduler14scheduler_loopEv + 1396
  [12] = 0   librealm.1.dylib                    0x000000011b1f15da _ZN5Realm21ThreadedTaskScheduler20scheduler_loop_wlockEv + 42
  [13] = 0   librealm.1.dylib                    0x000000011b204cec _ZN5Realm6Thread20thread_entry_wrapperINS_21ThreadedTaskSchedulerEXadL_ZNS2_20scheduler_loop_wlockEvEEEEvPv + 92
  [14] = 0   librealm.1.dylib                    0x000000011b208dc8 _ZN5Realm12KernelThread13pthread_entryEPv + 472
  [15] = 0   libsystem_pthread.dylib             0x00007ff8112101d3 _pthread_start + 125
~/cunumeric> echo $?
1
~/cunumeric> LEGATE_TEST=1 legate --cpus 1 a.py
[0 - 7ff8548e9700]    0.000225 {4}{threads}: reservation ('Python-1 proc 1d00000000000003') cannot be satisfied
libc++abi: terminating due to uncaught exception of type std::runtime_error: something went wrong
Signal 6 received by node 0, process 33601 (thread 70000bd86000) - obtaining backtrace
Signal 6 received by process 33601 (thread 70000bd86000) at: stack trace: 21 frames
  [0] = 0   libsystem_platform.dylib            0x00007ff81123d5ec _sigtramp + 28
  [1] = 0   libsystem_kernel.dylib              0x00007ff8111d8203 __pthread_kill + 11
  [2] = 0   libsystem_pthread.dylib             0x00007ff81120fee6 pthread_kill + 263
  [3] = 0   libsystem_c.dylib                   0x00007ff811136b45 abort + 123
  [4] = 0   libc++abi.dylib                     0x00007ff8111ca282 abort_message + 241
  [5] = 0   libc++abi.dylib                     0x00007ff8111bc3e1 _ZL28demangling_terminate_handlerv + 241
  [6] = 0   libobjc.A.dylib                     0x00007ff810e907d6 _ZL15_objc_terminatev + 104
  [7] = 0   libc++abi.dylib                     0x00007ff8111c96db _ZSt11__terminatePFvvE + 6
  [8] = 0   libc++abi.dylib                     0x00007ff8111cbfa7 __cxa_get_exception_ptr + 0
  [9] = 0   libc++abi.dylib                     0x00007ff8111cbf6e _ZN10__cxxabiv1L22exception_cleanup_funcE19_Unwind_Reason_CodeP17_Unwind_Exception + 0
  [10] = 0   libcunumeric.dylib                  0x00000001aadd19f8 _ZN9cunumeric13MatVecMulTask11cpu_variantERN6legate11TaskContextE + 72
  [11] = 0   liblgcore.dylib                     0x000000010a858a25 _ZN6legate6detail12task_wrapperEPFvRNS_11TaskContextEERKNSt3__112basic_stringIcNS5_11char_traitsIcEENS5_9allocatorIcEEEEPKvmSF_mN5Realm9ProcessorE + 181
  [12] = 0   libcunumeric.dylib                  0x00000001aadd1e50 _ZN6legate10LegateTaskIN9cunumeric13MatVecMulTaskEE19legate_task_wrapperIXadL_ZNS2_11cpu_variantERNS_11TaskContextEEEEEvPKvmS8_mN5Realm9ProcessorE + 80
  [13] = 0   librealm.1.dylib                    0x0000000114d80048 _ZN5Realm18LocalTaskProcessor12execute_taskEjRKNS_12ByteArrayRefE + 1272
  [14] = 0   librealm.1.dylib                    0x0000000114e1f83b _ZN5Realm4Task20execute_on_processorENS_9ProcessorE + 891
  [15] = 0   librealm.1.dylib                    0x0000000114e2654c _ZN5Realm25KernelThreadTaskScheduler12execute_taskEPNS_4TaskE + 44
  [16] = 0   librealm.1.dylib                    0x0000000114e24bd4 _ZN5Realm21ThreadedTaskScheduler14scheduler_loopEv + 1396
  [17] = 0   librealm.1.dylib                    0x0000000114e255da _ZN5Realm21ThreadedTaskScheduler20scheduler_loop_wlockEv + 42
  [18] = 0   librealm.1.dylib                    0x0000000114e38cec _ZN5Realm6Thread20thread_entry_wrapperINS_21ThreadedTaskSchedulerEXadL_ZNS2_20scheduler_loop_wlockEvEEEEvPv + 92
  [19] = 0   librealm.1.dylib                    0x0000000114e3cdc8 _ZN5Realm12KernelThread13pthread_entryEPv + 472
  [20] = 0   libsystem_pthread.dylib             0x00007ff8112101d3 _pthread_start + 125
~/cunumeric> echo $?
1

Also, is there any guidance on error handling inside tasks? For instance, what's the best way to indicate a task is not running correctly? Or is there a way to emit an error message explaining misuse by the caller?

Most user-caused (and thus recoverable) error conditions should get detected and reported before worker tasks gets launched. We typically use standard python exceptions to report such errors (ValueError, TypeError, ...), which the user code could ostensibly catch and recover from. The Legate library writer is responsible for checking and sanitizing inputs as much as possible, before they make it to the task body.

Sometimes a problematic input is only detected as part of running the computation (e.g. during a linear solve it is discovered that the matrix is singular). In that case the task should throw a special legate::TaskException, which will get caught and translated to a corresponding exception on the calling side, that the user could presumably catch and recover from. However there are a lot of caveats with this system, so it's preferable to catch user errors before the launch if possible.

Any remaining errors are internal errors, that the user cannot do anything about, and should just terminate the execution. You can either report an error on the appropriate logger then call the LEGATE_ABORT macro, or throw any exception besides legate::TaskException. I believe we are slowly transitioning to the latter.

Note that this is just my interpretation of our current practices, since this is not officially documented anywhere (and that's something we should fix). Inviting @magnatelee and @jjwilke to comment further.

This appears to have been a conscious decision by @lightsighter and @streichler, so I'll let them comment.

That's correct. Logging infrastructure should not dictate how error handling is performed. The Realm logging infrastructure will report things at different levels including the error level, but it will not automatically error out your application. Error handling is the responsibility of the client. Legion too doesn't specify how you should have an error happening inside your tasks for the same reason: we don't want to force you into handling errors a particular way.

Actually, Realm has a change in the works that will result in fatal messages terminating the application:
https://gitlab.com/StanfordLegion/legion/-/commit/4735e033938ae0ef295910977e30b2d0900ac087

The application has the ability to register a callback that Realm will call before terminating the application
(e.g. to save user data), but there's no way to "soldier on" from there.

The fatal message level though is different than the error message level right?

Correct. Realm uses error when it's going to continue operation, but is pretty certain that the application
is not going to produce the intended result.

I am not seeing this behavior; whether I insert a crash through an assertion or an uncaught exception, the exit code of the process is always 1:

Ah, I meant an error message is emitted through the logger.

In that case the task should throw a special legate::TaskException

Thank you for sharing.

Actually, Realm has a change in the works that will result in fatal messages terminating the application:

Nice!

Thank you all for the informative replies! I will close this issue now that it's clear.