StanfordLegion/legion

Legion: Deserializer segfault

syamajala opened this issue · 5 comments

I am occasionally seeing seg faults in S3D on Frontier.

Here is what I see in the stack traces:

[246] Thread 7 (Thread 0x7fff8db36fc0 (LWP 46425) "s3d.x"):
[246] #0  0x00007fffe4af274f in wait4 () from /lib64/libc.so.6
[246] #1  0x00007fffe4a69ba7 in do_system () from /lib64/libc.so.6
[246] #2  0x00007fffe15a8ef6 in gasneti_system_redirected () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz_subrank/legion/language/build/lib/librealm.so.1
[246] #3  0x00007fffe15a889b in gasneti_bt_gdb () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz_subrank/legion/language/build/lib/librealm.so.1
[246] #4  0x00007fffe159f0cf in gasneti_print_backtrace () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz_subrank/legion/language/build/lib/librealm.so.1
[246] #5  0x00007fffe16bc52a in gasneti_defaultSignalHandler () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz_subrank/legion/language/build/lib/librealm.so.1
[246] #6  <signal handler called>
[246] #7  Legion::Deserializer::deserialize<unsigned long> (this=0x7f869937a108, element=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz_subrank/legion/runtime/legion/legion_utilities.h:1117
[246] #8  Legion::Internal::TimeoutMatchExchange::unpack_collective_stage (this=<optimized out>, derez=..., stage=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz_subrank/legion/runtime/legion/legion_replication.cc:16944
[246] #9  0x00007fffe24b0105 in Legion::Internal::AllGatherCollective<false>::unpack_stage (this=0x7f6c30ffc580, stage=stage@entry=2, derez=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz_subrank/legion/runtime/legion/legion_replication.cc:12713
[246] #10 0x00007fffe24affc3 in Legion::Internal::AllGatherCollective<false>::handle_collective_message (this=0x7f6c30ffc580, derez=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz_subrank/legion/runtime/legion/legion_replication.cc:12455
[246] #11 0x00007fffe26af398 in Legion::Internal::VirtualChannel::handle_messages (this=this@entry=0x7ffa8018ecf0, num_messages=num_messages@entry=1, runtime=0x0, runtime@entry=0x644e500, remote_address_space=<optimized out>, remote_address_space@entry=118, args=0x7f6b6d7aa594 "", args@entry=0x7f6b6d7aa580 "1\001", arglen=<optimized out>, arglen@entry=38796) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz_subrank/legion/runtime/legion/runtime.cc:13376
[246] #12 0x00007fffe26af178 in Legion::Internal::VirtualChannel::process_message (this=0x7ffa8018ecf0, args=args@entry=0x0, arglen=<optimized out>, arglen@entry=105178368, runtime=0x1856a70 <__tracer_m_MOD_dloc+2805328>, remote_address_space=118) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz_subrank/legion/runtime/legion/runtime.cc:11833
[246] #13 0x00007fffe26ebdb0 in Legion::Internal::MessageManager::receive_message (this=0x1856a68 <__tracer_m_MOD_dloc+2805320>, args=<optimized out>, arglen=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz_subrank/legion/runtime/legion/runtime.cc:13524
[246] #14 0x00007fffe2702e28 in Legion::Internal::Runtime::legion_runtime_task (args=<optimized out>, arglen=6100, userdata=<optimized out>, userlen=<optimized out>, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz_subrank/legion/runtime/legion/runtime.cc:32338
[246] #15 0x00007fffe0c52b6d in Realm::LocalTaskProcessor::execute_task (this=0x71451f0, func_id=4, task_args=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz_subrank/legion/runtime/realm/proc_impl.cc:1175
[246] #16 0x00007fffe0c933bc in Realm::Task::execute_on_processor (this=0x7f6a65a98cd0, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz_subrank/legion/runtime/realm/tasks.cc:326
[246] #17 0x00007fffe0c999d3 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=0x0) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz_subrank/legion/runtime/realm/tasks.cc:1687
[246] #18 0x00007fffe0c96cbf in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x63167c0) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz_subrank/legion/runtime/realm/tasks.cc:1160
[246] #19 0x00007fffe0ca18fd in Realm::UserThread::uthread_entry () at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz_subrank/legion/runtime/realm/threads.cc:1355
[246] #20 0x00007fffe4a72600 in ?? () from /lib64/libc.so.6
[246] #21 0x0000000000000000 in ?? ()

What was the commit hash for this? The line numbers don't match anything obvious.

I believe it was:

commit 13cb4852e519f58b910a824674c157959dcb43e8 (HEAD)
Author: Mike <mebauer@cs.stanford.edu>
Date:   Fri Nov 3 01:35:34 2023 -0700

    legion: disable dumb compiler warnings that do not understand what they are talking about, compiler writers need to get off their butts and write a proper context-sensitive static analysis if they want to do crap like this

But I'm not 100% sure because I seemed to have misplaced the logs from this crash.

The line numbers seem slightly better, but that would mean the stack is not aligned or got smashed or something.

What is the signal for this error? Can you also dumb the assembly of the few instructions around the instruction that crashed? If possible also print the values of num_timeouts and &key.first in frame 8.

I have not been able to reproduce this again and am seeing the issue reported in #1595 instead. If I see this again I will reopen this issue.