StanfordLegion/legion

Non-deterministic segmentation fault

mariodirenzo opened this issue · 4 comments

Some of my multimode executions fail randomly with a segmentation fault that has the following backtrace

#0  0x000020000beaeb88 in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x000020000beae8bc in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:137
#2  0x0000000013f45718 in Realm::realm_freeze (signal=<optimized out>) at /g/g92/direnzo1/legion/runtime/realm/runtime_impl.cc:204
#3  <signal handler called>
#4  0x66664f2200000000 in ?? ()
#5  0x0000000013c02e00 in Legion::Internal::GatherCollective::perform_collective_async (this=0x18828038, precondition=...) at /g/g92/direnzo1/legion/runtime/legion/legion_replication.cc:12029
#6  0x0000000013c00968 in Legion::Internal::ShardCollective::handle_deferred_collective (args=<optimized out>) at /g/g92/direnzo1/legion/runtime/legion/legion_replication.cc:11811
#7  0x000000001356c554 in Legion::Internal::Runtime::legion_runtime_task (args=0x20408a69cdc0, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...) at /g/g92/direnzo1/legion/runtime/legion/runtime.cc:32671
#8  0x0000000013f274d8 in Realm::LocalTaskProcessor::execute_task (this=0x3a8fd840, func_id=<optimized out>, task_args=...) at /g/g92/direnzo1/legion/runtime/realm/bytearray.inl:150
#9  0x0000000013f8ee78 in Realm::Task::execute_on_processor (this=0x20408a69cc40, p=...) at /g/g92/direnzo1/legion/runtime/realm/bytearray.inl:39
#10 0x0000000013f8efe4 in Realm::KernelThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /g/g92/direnzo1/legion/runtime/realm/tasks.cc:1421
#11 0x0000000013f8cda8 in Realm::ThreadedTaskScheduler::scheduler_loop (this=this@entry=0x3a8fdb90) at /g/g92/direnzo1/legion/runtime/realm/tasks.cc:1158
#12 0x0000000013f923a4 in scheduler_loop_wlock (this=0x3a8fdb90) at /g/g92/direnzo1/legion/runtime/realm/tasks.cc:1272
#13 Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x3a8fdb90) at /g/g92/direnzo1/legion/runtime/realm/threads.inl:97
#14 0x0000000013f97540 in Realm::KernelThread::pthread_entry (data=0x20408a76cc20) at /g/g92/direnzo1/legion/runtime/realm/threads.cc:831
#15 0x0000200000128cd4 in start_thread (arg=0x2000fd04f8b0) at pthread_create.c:309
#16 0x000020000bef7f14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

I am using legion on ec0c8500ed8491c8122fc83319e824e026e0f95b compiled in release mode with debug symbols and I have been able to reproduce this bug by running HTR on 4 nodes for a few hours.

@elliottslaughter, can you please add this issue to #1032 ?

Try this patch and report back if it fixes the issue:

diff --git a/runtime/legion/legion_replication.cc b/runtime/legion/legion_replication.cc
index 760f76dac..6d0f5a8c3 100644
--- a/runtime/legion/legion_replication.cc
+++ b/runtime/legion/legion_replication.cc
@@ -11979,7 +11979,7 @@ namespace Legion {
         received_notifications(0)
     //--------------------------------------------------------------------------
     {
-      if (expected_notifications > 1)
+      //if (expected_notifications > 1)
         done_event = Runtime::create_rt_user_event();
     }
 
@@ -11992,7 +11992,7 @@ namespace Legion {
         received_notifications(0)
     //--------------------------------------------------------------------------
     {
-      if (expected_notifications > 1)
+      //if (expected_notifications > 1)
         done_event = Runtime::create_rt_user_event();
     }

I've been running for 12 hrs without seeing a segmentation fault. I think that the patch fixes the issue

@mariodirenzo Try the latest control replication without the patch and see if it is good. If so you can close the issue.

This bug has been fixed. Thanks!