StanfordLegion/legion

Realm: Assertion `(level_needed <= n->level) && (index >= n->first_index) && (index <= n->last_index)' failed

Closed this issue · 20 comments

I am seeing the following crash with S3D:

realloc(): invalid pointer
Legion process received signal 6: Aborted
Process 2375256 on node g0003.stanford.edu is frozen!
s3d.x: /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/dynamic_table.inl:264: Realm::DynamicTable<ALLOCATOR>::ET* Realm::DynamicTable<ALLOCATOR>::lookup_entry(Realm::DynamicTable<ALLOCATOR>::IT, int, Realm::DynamicTable<ALLOCATOR>::ET**, Realm::DynamicTable<ALLOCATOR>::ET**) [with ALLOCATOR = Realm::DynamicTableAllocator<Realm::BarrierImpl, 10, 4>; Realm::DynamicTable<ALLOCATOR>::ET = Realm::BarrierImpl; Realm::DynamicTable<ALLOCATOR>::IT = long long unsigned int]: Assertion `(level_needed <= n->level) && (index >= n->first_index) && (index <= n->last_index)' failed.
Legion process received signal 6: Aborted
Process 2375256 on node g0003.stanford.edu is frozen!
*** stack smashing detected ***: terminated

Here is a stack trace:

    req=req@entry=0x7feba8047920, rem=rem@entry=0x7feba8047920) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
#1  0x00007ff49561aec7 in __GI___nanosleep (requested_time=requested_time@entry=0x7feba8047920, 
    remaining=remaining@entry=0x7feba8047920) at nanosleep.c:27
#2  0x00007ff49561adfe in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x00007ff47f6afacf in Realm::realm_freeze (signal=6)
    at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/runtime_impl.cc:200
#4  <signal handler called>
#5  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#6  0x00007ff49555a859 in __GI_abort () at abort.c:79
#7  0x00007ff49555a729 in __assert_fail_base (fmt=0x7ff4956f0588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=0x7ff48035af80 "(level_needed <= n->level) && (index >= n->first_index) && (index <= n->last_index)", 
    file=0x7ff480359b68 "/scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/dynamic_table.inl", line=264, 
    function=<optimized out>) at assert.c:92
#8  0x00007ff49556bfd6 in __GI___assert_fail (
    assertion=0x7ff48035af80 "(level_needed <= n->level) && (index >= n->first_index) && (index <= n->last_index)", 
    file=0x7ff480359b68 "/scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/dynamic_table.inl", line=264, 
    function=0x7ff48035b4f0 "Realm::DynamicTable<ALLOCATOR>::ET* Realm::DynamicTable<ALLOCATOR>::lookup_entry(Realm::DynamicTable<ALLOCATOR>::IT, int, Realm::DynamicTable<ALLOCATOR>::ET**, Realm::DynamicTable<ALLOCATOR>::ET**) [w"...)
    at assert.c:101
#9  0x00007ff47f56143e in Realm::DynamicTable<Realm::DynamicTableAllocator<Realm::BarrierImpl, 10ul, 4ul> >::lookup_entry (this=0x55600c217e10, index=1, owner=0, free_list_head=0x0, free_list_tail=0x0)
    at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/dynamic_table.inl:264
#10 0x00007ff47f6bcc66 in Realm::RuntimeImpl::get_barrier_impl (this=0x55600c1bed50, e=...)
    at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/runtime_impl.cc:2945
#11 0x00007ff47f5478dd in Realm::Barrier::arrive (this=0x7febc3fde430, count=1, wait_on=..., reduce_value=0x0, 
    reduce_value_size=0) at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/event_impl.cc:524
#12 0x00007ff483186e44 in Legion::Internal::Runtime::phase_barrier_arrive (bar=..., count=1, precondition=..., 
    reduce_value=0x0, reduce_value_size=0) at /scratch2/seshu/legion_s3d_tdb/legion/runtime/legion/runtime.h:5609
#13 0x00007ff4836d004e in Legion::Internal::LegionHandshakeImpl::legion_handoff_to_ext (this=0x55600bfe6000)
    at /scratch2/seshu/legion_s3d_tdb/legion/runtime/legion/runtime.cc:6579
#14 0x00007ff48322e894 in Legion::LegionHandshake::legion_handoff_to_ext (this=0x55600bc69230)
    at /scratch2/seshu/legion_s3d_tdb/legion/runtime/legion/legion.cc:2225
#15 0x00007ff495cf9ef2 in Legion::MPILegionHandshake::legion_handoff_to_mpi (this=0x55600bc69230)
    at /scratch2/seshu/legion_s3d_tdb/legion/runtime/legion.h:4324
#16 0x00007ff495cf80e2 in legion_handoff_to_mpi () at s3d_rank_wrapper.cc:334
#17 0x00007ff4952e7ea9 in $<HandoffToMPITask> () from /scratch2/seshu/legion_s3d_tdb//build/shelld/libmpi_tasks.so
#18 0x00007ff4952c12e9 in $__regent_task_HandoffToMPITask_primary ()
   from /scratch2/seshu/legion_s3d_tdb//build/shelld/libmpi_tasks.so
#19 0x00007ff47f68473c in Realm::LocalTaskProcessor::execute_task (this=0x55600c385040, func_id=22, task_args=...)
    at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/proc_impl.cc:1175
#20 0x00007ff47f70952a in Realm::Task::execute_on_processor (this=0x7febaa692280, p=...)
    at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/tasks.cc:326
#21 0x00007ff47f70d926 in Realm::KernelThreadTaskScheduler::execute_task (this=0x55600c385360, task=0x7febaa692280)
    at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/tasks.cc:1421
#22 0x00007ff47f70c66a in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55600c385360)
    at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/tasks.cc:1160
#23 0x00007ff47f70ccbb in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x55600c385360)
    at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/tasks.cc:1272
#24 0x00007ff47f714dc2 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x55600c385360) at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/threads.inl:97
#25 0x00007ff47f72209c in Realm::KernelThread::pthread_entry (data=0x7feba8007060)
    at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/threads.cc:781
#26 0x00007ff47e2d0609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#27 0x00007ff495657353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I am on an older commit of the runtime:

commit 13cb4852e519f58b910a824674c157959dcb43e8 (HEAD)
Author: Mike <mebauer@cs.stanford.edu>
Date:   Fri Nov 3 01:35:34 2023 -0700

    legion: disable dumb compiler warnings that do not understand what they are talking about, compiler writers need to get off their butts and write a proper context-sensitive static analysis if they want to do crap like this

That's odd. From the callstack we do seem to be entering the phase barriers.

@syamajala Are you running into this crash consistently? What branch are you on?

Yes it is very consistent. I think I am on an older version of control_replication but I'm not 100% sure? I know I used this commit to scale to 8192 nodes on Frontier in November and I want to do runs at that scale again, so thats why I'm using this commit.

I see this on the latest control replication as well.

s3d.x: /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/dynamic_table.inl:307: Realm::DynamicTable<ALLOCATOR>::ET* Realm::DynamicTable<ALLOCATOR>::lookup_entry(Realm::DynamicTable<ALLOCATOR>::IT, int, Realm::DynamicTable<ALLOCATOR>::ET**, Realm::DynamicTable<ALLOCATOR>::ET**) [with ALLOCATOR = Realm::DynamicTableAllocator<Realm::BarrierImpl, 10, 4>; Realm::DynamicTable<ALLOCATOR>::ET = Realm::BarrierImpl; Realm::DynamicTable<ALLOCATOR>::IT = long long unsigned int]: Assertion `(child != 0) && (child->level == (n_level - 1)) && (index >= child->first_index) && (index <= child->last_index)' failed.
Legion process received signal 6: Aborted
Process 3857564 on node gpu002 is frozen!
Legion process received signal 11: Segmentation fault
Process 4124403 on node gpu001 is frozen!
Legion process received signal 11: Segmentation fault
Process 4124403 on node gpu001 is frozen!
Legion process received signal 11: Segmentation fault
Process 3857564 on node gpu002 is frozen!

Stack trace:

#0  0x00001555527be238 in nanosleep () from /lib64/libc.so.6
#1  0x00001555527be13e in sleep () from /lib64/libc.so.6
#2  0x000015552f645be4 in Realm::realm_freeze (signal=11) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/runtime_impl.cc:204
#3  <signal handler called>
#4  0x000015553302bee6 in Legion::Internal::ApBarrier::operator Realm::Barrier (this=0xbd73af410bcfad4b) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/legion/legion_types.h:2964
#5  0x000015553311090f in Legion::Internal::Runtime::phase_barrier_arrive (bar=..., count=1, precondition=..., reduce_value=0x0, reduce_value_size=0) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/legion/runtime.h:5563
#6  0x00001555336034b4 in Legion::Internal::LegionHandshakeImpl::legion_handoff_to_ext (this=0xbd73af410bcfad0b) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/legion/runtime.cc:6734
#7  0x00001555331a80b2 in Legion::LegionHandshake::legion_handoff_to_ext (this=0x46b1050) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/legion/legion.cc:2242
#8  0x0000155554792ec4 in Legion::MPILegionHandshake::legion_handoff_to_mpi (this=0x46b1050) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/legion.h:4384
#9  0x000015555479138d in legion_handoff_to_mpi () at s3d_rank_wrapper.cc:334
#10 0x0000155551e57ce9 in $<HandoffToMPITask> () from /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/build/shelld/libmpi_tasks.so
#11 0x0000155551e31119 in $__regent_task_HandoffToMPITask_primary () from /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/build/shelld/libmpi_tasks.so
#12 0x000015552f61f046 in Realm::LocalTaskProcessor::execute_task (this=0x4a4d6a0, func_id=36, task_args=...) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/proc_impl.cc:1176
#13 0x000015552f695d10 in Realm::Task::execute_on_processor (this=0x155082794950, p=...) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/tasks.cc:326
#14 0x000015552f699c3c in Realm::KernelThreadTaskScheduler::execute_task (this=0x4a4da40, task=0x155082794950) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/tasks.cc:1421
#15 0x000015552f698abb in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x4a4da40) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/tasks.cc:1160
#16 0x000015552f6990de in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x4a4da40) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/tasks.cc:1272
#17 0x000015552f6a021a in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x4a4da40) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/threads.inl:97
#18 0x000015552f6ac5a1 in Realm::KernelThread::pthread_entry (data=0x155082505ce0) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/realm/threads.cc:831
#19 0x0000155552ac02de in start_thread () from /lib64/libpthread.so.0
#20 0x00001555527f1e83 in clone () from /lib64/libc.so.6

What is the hex value of bar in frame 5?

Not able to print bar:

>>> frame 5
#5  0x000015553311090f in Legion::Internal::Runtime::phase_barrier_arrive (bar=..., count=1, precondition=..., reduce_value=0x0, reduce_value_size=0) at /lustre/scratch/vsyamaj/legion_s3d_tdb_subranks/legion/runtime/legion/runtime.h:5563
5563          Realm::Barrier copy = bar.phase_barrier;
>>> p/x bar
$1 = (const Legion::PhaseBarrier &) <error reading variable>

This is a full debug build.

If you can't read a variable on the stack like that something terrible has gone wrong, like the OS protected the memory on the stack while trying to clean-up the process. What does *this show in frame 6?

It must be some sort of corruption. So, clearly the event given at frame#10 was valid:

    BarrierImpl *RuntimeImpl::get_barrier_impl(Event e) {                   
      ID id(e);                                                             
      assert(id.is_barrier());                                              

otherwise we would have hit this assert.

Then we go into:

      BarrierImpl *impl = n->barriers.lookup_entry(id.barrier_barrier_idx(),
						   id.barrier_creator_node());

which is the dynamic table lookup. It certainly not a bug in the dynamic table...at least not a recent one and we don't seem to have a chance here to mess something up with the barrier itself since we merely just trying to fetch the underlying allocation.

You need to be a bit careful. The backtrace that @syamajala reported above is not actually the one that hit the assertion. It never even made it into Realm. It's still trying to copy one barrier on the stack to another and getting a segfault when doing that (which is very strange). Somewhere else there is a thread that actually hit the assertion. This reeks of memory corruption somewhere that is probably corrupting the name of the other barrier and might even be scribbling badly on this stack. @syamajala can you make a reproducer on sapling?

I will try to reproduce on sapling.

I am seeing a strange issue when I try to run on sapling.

output.target_procs is empty in map_task:

s3d.x: rhst_mapper.cc:590: virtual void RHSTMapper::map_task(MapperContext, const Task&, const Legion::Mapping::Mapper$
:MapTaskInput&, Legion::Mapping::Mapper::MapTaskOutput&): Assertion `output.target_procs.size() == 1' failed.

You're expecting it to be non-empty because it's replicated? Is it because you're running on just one node and therefore there is no replication so the output.target_procs is not prepopulated?

Yes I'm only running on 1 node. I'm not sure if I fixed it properly because the error I'm seeing now after populating output.target_procs is this:

s3d.x: /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/event_impl.cc:2304: void Realm::BarrierImpl::adjust_arrival(Realm::EventImpl::gen_t, int, Realm::Barrier::timestamp_t, Realm::Event, Realm::NodeID, bool, const void*, size_t, Realm::TimeLimit): Assertion `base_arrival_count > 0' failed.

There is a process on g0003: 2941108.

This looks like an application bug to me, like the way that the handshakes are being used is no longer correct. Can you run with -level barrier=1 and get it to crash again with a log file for what the barriers are doing?

It did crash with -level barrier=1. The log is here: /scratch2/seshu/legion_s3d_tdb/Shell1D/DBO_Test_1/run/run_0.log

I left the process 2965166 running on g0003.

It's not the barriers. There is heap corruption going on in this application. This run died in the logger iterating over the list of streams for that logger:

#3  0x00007f9a5aa42103 in Realm::realm_freeze (signal=11) at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/runtime_impl.cc:204
#4  <signal handler called>
#5  Realm::Logger::log_msg (this=0x7f9a5c083400 <Realm::log_barrier>, level=Realm::Logger::LEVEL_INFO, msgdata=0x7f902afde130 "barrier adjustment: event=2000000000100004 delta=-1 ts=0\020T:\b\220\177", msglen=56) at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/logging.cc:584
#6  0x00007f9a72dc4dfb in Realm::LoggerMessage::~LoggerMessage (this=0x7f902afde0d0, __in_chrg=<optimized out>) at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/logging.inl:590
#7  0x00007f9a5a8e2416 in Realm::BarrierImpl::adjust_arrival (this=0x7f90083a5630, barrier_gen=4, delta=-1, timestamp=0, wait_on=..., sender=0, forwarded=false, reduce_value=0x0, reduce_value_size=0, work_until=...) at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/event_impl.cc:2262
#8  0x00007f9a5a8da158 in Realm::Barrier::arrive (this=0x7f902afde430, count=1, wait_on=..., reduce_value=0x0, reduce_value_size=0) at /scratch2/seshu/legion_s3d_tdb/legion/runtime/realm/event_impl.cc:513
#9  0x00007f9a5e5dd66e in Legion::Internal::Runtime::phase_barrier_arrive (bar=..., count=1, precondition=..., reduce_value=0x0, reduce_value_size=0) at /scratch2/seshu/legion_s3d_tdb/legion/runtime/legion/runtime.h:5564

The STL vector of streams in the logger has been completely stomped on and has non-sensical pointers:

(gdb) p streams
$13 = std::vector of length 2, capacity 2 = {{
    s = 0xffffff9cffffff9c,
    min_level = 4294967196,
    delete_when_done = 156,
    flush_each_write = 255
  }, {
    s = 0xffffff9cffffff9c,
    min_level = 4294967196,
    delete_when_done = 156,
    flush_each_write = 255
  }}

Can you try running with valgrind?

The last time I tried, valgrind no longer worked with Regent and its pretty difficult to test it with S3D.

This version of S3D we're seeing this crash in merges 3 different branches together that each work individually so its most likely I messed up the merge somewhere. I might try to look through the code first before I try valgrind.

Do what you think is best, but this is definitely heap corruption.

Given the proximity in time to the handshake, I wouldn't be surprised if it had something to do with the packing/unpacking of arguments crossing to/from fortran and the APIs being different, especially since that property isn't type-checked by any of the compilers, but that is just a guess.

Was stepping out of bounds when copying an array to fortran. It runs now.