StanfordLegion/legion

Legion: hang at 256 nodes on Frontier

syamajala opened this issue · 207 comments

I'm seeing a hang at 256 nodes on Frontier with S3D running the latest control replication. It happens after the first 10 timesteps. I am also using the new gasnet release 2023.9.0.

There are some stack traces here: http://sapling2.stanford.edu/~seshu/s3d_subrank/stacktraces/

Are you sure these backtraces are not changing? If the backtraces are not changing, here's a message send that is getting stuck in a syscall to mprotect trying to pin the buffer:

#0  0x00007fffe4b22177 in mprotect () from /lib64/libc.so.6
#1  0x00007fffe4aac247 in sysmalloc () from /lib64/libc.so.6
#2  0x00007fffe4aad360 in _int_malloc () from /lib64/libc.so.6
#3  0x00007fffe4aae618 in malloc () from /lib64/libc.so.6
#4  0x00007fffe03c7527 in Realm::Serialization::DynamicBufferSerializer::DynamicBufferSerializer (this=0x7f744cfff358, initial_size=4160) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/serialize.inl:158
#5  Realm::ProcessorGroupImpl::spawn_task (this=0x7ffbc61a7fe0, func_id=4, args=0x7ffbc622d8c0, arglen=<optimized out>, reqs=..., start_event=..., finish_event=0x7f4486361d08, finish_gen=1, priority=2) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/proc_impl.cc:791
#6  0x00007fffe03c415a in Realm::Processor::spawn (this=<optimized out>, func_id=4096, args=0x3, arglen=140737030267255, reqs=..., wait_on=..., priority=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/proc_impl.cc:102
#7  0x00007fffe1e83159 in Legion::Internal::VirtualChannel::send_message (this=this@entry=0x7ffbc62241d0, complete=<optimized out>, runtime=<optimized out>, target=..., kind=Legion::Internal::DISTRIBUTED_DOWNGRADE_REQUEST, kind@entry=Legion::Internal::SEND_STARTUP_BARRIER, response=false, shutdown=<optimized out>, send_precondition=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:11584
...

I will do another run and take some stack traces over time, but 256, 512, and 1024 nodes all did not make it past 10 timesteps on Frontier.

I reran and the stack traces seemed to have a lot of profiling stuff in them. It did eventually make progress but it took about 10 minutes to finish 10 timesteps. I ran without profiling and it took 3 minutes.

The stacktraces from running with the profiler are here: http://sapling2.stanford.edu/~seshu/s3d_subrank/stacktraces/

I don't think this is a hang. This is a result of the new profiling changes that log all the mapper API calls and your filesystem not being able to keep up with all the logging that we do now. Trying filtering the API call logging with -lg:prof_call_threshold.

That was what I was thinking too, but I couldnt remember the details of what we talked about in the legion meeting a few weeks ago and wasn't sure if those changes got merged yet. I will try that.

Right now the default is 0. The "old" default which was hard-coded was 100us.

I tried -lg:prof_call_threshold 1000 but did not have any luck with that.

How large are the profiling logs that are produced now compared to before?

It does not actually look like there is that much of a difference. The runs I did on Nov 4th the profile logs are around 75mb/node at 128 nodes. The current runs around 90mb/node at 128 nodes, but there are some nodes that are significantly larger for some reason, like 223mb or 165mb.

Also the logs at 256 nodes are all only 33kb so I think its not even writing anything during the run? Does the runtime only write the profile logs at shutdown?

Does the runtime only write the profile logs at shutdown?

The runtime does buffer profiling data in memory until either we exceed the buffer size (default 512 MB) or we reach the end of the runtime, you can force eager logging by setting the profiling buffer size to 0 with -lg:prof_footprint 0.

I think it would be good to try to make profiles of the slow running performance with the profiling on so we can see what is going on.

How does the performance time with profiling OFF compare between the old version and the new version?

I tried some more experiments. -lg:prof_footprint 0 -lg:prof_call_threshold 1000 and writing the logs to the local NVMe disks on the compute nodes instead of the network drive. None of them seemed to make a difference.

It does not seem to be possible to generate profiles from partial logs with -lg:prof_footprint 0. I could try to profile with vtune on sapling if that would be useful.

I never did runs without profiling with the old version so I don't have comparisons. I could compare the runs that did finish up to 128 nodes with profiling.

Tried to do some more runs without profiling but they seem to hang after 10 timesteps. They worked with -ll:force_kthreads though.

What do the profiles with -ll:force_kthreads look like?

Do you have an update on this?

Can you also try running without tracing and see if it hangs then and whether there is any discrepancy with and without profiling with the tracing off?

I'd like to figure this out before we merge the control replication branch.

I will do some more runs.

I compared the runs that did work with the old runs with profiling and they look very similar up to 128 nodes. I'm not able to produce any profiles at 256 nodes with the latest runtime.

It seg faults after 10 timesteps if I turn off tracing:

[104] #3  0x00007fffe0d2072b in gasneti_bt_gdb () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/l
anguage/build/lib/librealm.so.1
[104] #4  0x00007fffe0d16dff in gasneti_print_backtrace () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/
/legion/language/build/lib/librealm.so.1
[104] #5  0x00007fffe0e3451a in gasneti_defaultSignalHandler () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_sub
rank//legion/language/build/lib/librealm.so.1
[104] #6  <signal handler called>
[104] #7  0x00007fffe4a5bcbb in raise () from /lib64/libc.so.6
[104] #8  0x00007fffe4a5d355 in abort () from /lib64/libc.so.6
[104] #9  0x00007fffdbaff5b9 in __gnu_cxx::__verbose_terminate_handler () at ../../../../cpe-gcc-12.2.0-202211182106.97b1815c41a72/libstdc++-v3/libsupc++/vterminate.cc:95
[104] #10 0x00007fffdbb0abea in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../cpe-gcc-12.2.0-202211182106.97b1815c41a72/libstdc++-v3/libsupc++/eh_terminate.cc:48
[104] #11 0x00007fffdbb0ac55 in std::terminate () at ../../../../cpe-gcc-12.2.0-202211182106.97b1815c41a72/libstdc++-v3/libsupc++/eh_terminate.cc:58
[104] #12 0x00007fffdbb0aea7 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x7fffdbe6ce58 <typeinfo for std::bad_alloc>, dest=0x7fffdbb09260 <std::bad_alloc::~bad_alloc()>) at ../../../../cpe-gcc-12.2.0-202211182106.97b1815c41a72/libstdc++-v3/libsupc++/eh_throw.cc:98
[104] #13 0x00007fffdbaff1c4 in operator new (sz=sz@entry=560854725137600) at ../../../../cpe-gcc-12.2.0-202211182106.97b1815c41a72/libstdc++-v3/libsupc++/new_op.cc:54
[104] #14 0x00007fffe183b9ba in __gnu_cxx::new_allocator<unsigned int>::allocate (this=0x7f77a81c5f68, __n=140213681284400) at /usr/lib64/gcc/x86_64-suse-linux/7/../../../../include/c++/7/ext/new_allocator.h:111
[104] #15 std::allocator_traits<std::allocator<unsigned int> >::allocate (__a=..., __n=140213681284400) at /usr/lib64/gcc/x86_64-suse-linux/7/../../../../include/c++/7/bits/alloc_traits.h:436
[104] #16 std::_Vector_base<unsigned int, std::allocator<unsigned int> >::_M_allocate (this=0x7f77a81c5f68, __n=140213681284400) at /usr/lib64/gcc/x86_64-suse-linux/7/../../../../include/c++/7/bits/stl_vector.h:172
[104] #17 std::vector<unsigned int, std::allocator<unsigned int> >::_M_default_append (this=0x7f77a81c5f68, __n=140213681284400) at /usr/lib64/gcc/x86_64-suse-linux/7/../../../../include/c++/7/bits/vector.tcc:571
[104] #18 0x00007fffe1c7415b in std::vector<unsigned int, std::allocator<unsigned int> >::resize (this=0x7f77a81c5f68, __new_size=140213681284400) at /usr/lib64/gcc/x86_64-suse-linux/7/../../../../include/c++/7/bits/stl_vector.h:692
[104] #19 Legion::Internal::ProjectionTreeExchange::unpack_collective_stage (this=<optimized out>, derez=..., stage=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:16609
[104] #20 0x00007fffe1c77c55 in Legion::Internal::AllGatherCollective<false>::unpack_stage (this=0x7fb711ec8a40, stage=stage@entry=0, derez=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:12490
[104] #21 0x00007fffe1c77b13 in Legion::Internal::AllGatherCollective<false>::handle_collective_message (this=0x7fb711ec8a40, derez=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:12232
[104] #22 0x00007fffe1e84108 in Legion::Internal::VirtualChannel::handle_messages (this=this@entry=0x7ffa802e6160, num_messages=num_messages@entry=1, runtime=0x7f926dfff4b0, runtime@entry=0x8cf8e40, remote_address_space=<optimized out>, remote_address_space@entry=111, args=0x7f76c40e53f0 "", args@entry=0x7f76c40e53dc "\020\001", arglen=<optimized out>, arglen@entry=48) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:13433
[104] #23 0x00007fffe1e8385f in Legion::Internal::VirtualChannel::process_message (this=0x7ffa802e6160, args=args@entry=0x0, arglen=48, arglen@entry=147820096, runtime=0x8cf8e40, remote_address_space=111) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:11795
[104] #24 0x00007fffe1ec0760 in Legion::Internal::MessageManager::receive_message (this=0x0, args=<optimized out>, arglen=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:13581
[104] #25 0x00007fffe1ed7d4e in Legion::Internal::Runtime::legion_runtime_task (args=<optimized out>, arglen=64, userdata=<optimized out>, userlen=<optimized out>, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:32159
[104] #26 0x00007fffe03c982d in Realm::LocalTaskProcessor::execute_task (this=0x72c1580, func_id=4, task_args=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/proc_impl.cc:1176
[104] #27 0x00007fffe0409a2c in Realm::Task::execute_on_processor (this=0x7f77a83d0240, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:326
[104] #28 0x00007fffe0410043 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=0x2) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1687
[104] #29 0x00007fffe040d32f in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x751d600) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1160
[104] #30 0x00007fffe041849d in Realm::UserThread::uthread_entry () at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/threads.cc:1405
[104] #31 0x00007fffe4a72600 in ?? () from /lib64/libc.so.6
[104] #32 0x0000000000000000 in ?? ()

That looks like a control replication violation. Run with -lg:safe_ctrlrepl 1.

If the safe control replication checks don't reveal anything then make a reproducer on sapling.

That error looks correct to me. You are specifying a different task to launch on every shard.

Shouldn't you launch get_timestep_info(lp_q_rank[{0, 0, 0}])? That will run one point task, and then Legion will broadcast the result.

I had originally tried get_timestep_info(lp_q_rank[{0, 0, 0}]) way back when I originally moved the mpi tasks to regent and from what I can remember the performance was significantly worse because it was doing a copy from node 0 to every other node.

I have some other ideas I can try though.

I had originally tried get_timestep_info(lp_q_rank[{0, 0, 0}]) way back when I originally moved the mpi tasks to regent and from what I can remember the performance was significantly worse because it was doing a copy from node 0 to every other node.

You should be using the functionality in the mapper to replicate that individual task launch on every node then and produce multiple copies of the data in different places. Replication is replication regardless of whether it is control replication or not. The mapper interface is fully general now when it comes to replicating tasks.

We have to fix the control replication violation though. That could be responsible for any number of crashes and hangs. Once you fix it we should try it again both with and without tracing.

I am testing a fix now.

That will run one point task, and then Legion will broadcast the result.

If you're broadcasting regions then collective broadcast copy logic should kick in. If this is a future you should try using the local function tasks.

I'm pretty sure this is logic that amounts to:

cur_time += dt

There really is no reason to launch a task around this, except that we have to dig the logic out of Fortran to figure out what it is. But once we know the logic there is no reason to use fancy mechanisms to compute this.

Seems like a really good use case for local function tasks (which are control replication safe).

The control replication violation is fixed. I will try another run on Frontier.

Seems like profiling is working on Frontier now at 256 nodes as well. Will continue to try scaling.

Seems like profiling is working on Frontier now at 256 nodes as well. Will continue to try scaling.

Is that both with and without tracing?

With tracing. I will try without. It looks like theres issues at 512 nodes though.

What kind of issues at 512 nodes? With tracing or without tracing?

I dont see 512 making progress after 10 timesteps with profiling on or off.

Turning off tracing at 256 nodes I see a crash after 10 timesteps:

[198] Thread 6 (Thread 0x7fff9247ef80 (LWP 51226) "s3d.x"):
[198] #0  0x00007fffe4af274f in wait4 () from /lib64/libc.so.6
[198] #1  0x00007fffe4a69ba7 in do_system () from /lib64/libc.so.6
[198] #2  0x00007fffe0d25d16 in gasneti_system_redirected () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[198] #3  0x00007fffe0d256bb in gasneti_bt_gdb () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[198] #4  0x00007fffe0d1beef in gasneti_print_backtrace () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[198] #5  0x00007fffe0e3934a in gasneti_defaultSignalHandler () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[198] #6  <signal handler called>
[198] #7  Legion::Deserializer::deserialize<unsigned int> (this=0x7fb71efff920, element=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_utilities.h:1119
[198] #8  Legion::Internal::ProjectionTreeExchange::unpack_collective_stage (this=<optimized out>, derez=..., stage=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:16621
[198] #9  0x00007fffe1c7c1a5 in Legion::Internal::AllGatherCollective<false>::unpack_stage (this=0x7f76e025d120, stage=stage@entry=0, derez=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:12500
[198] #10 0x00007fffe1c7c063 in Legion::Internal::AllGatherCollective<false>::handle_collective_message (this=0x7f76e025d120, derez=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:12242
[198] #11 0x00007fffe1e88bc8 in Legion::Internal::VirtualChannel::handle_messages (this=this@entry=0x7ffbc53f0d90, num_messages=num_messages@entry=1, runtime=0x7f767ebf0a70, runtime@entry=0x8ceb7e0, remote_address_space=<optimized out>, remote_address_space@entry=195, args=0x7f767ebf0a70 "", args@entry=0x7f767ebf0a5c "\020\001", arglen=<optimized out>, arglen@entry=48) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:13427
[198] #12 0x00007fffe1e8831f in Legion::Internal::VirtualChannel::process_message (this=0x7ffbc53f0d90, args=args@entry=0x0, arglen=48, arglen@entry=147765216, runtime=0x8ceb7e0, remote_address_space=195) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:11789
[198] #13 0x00007fffe1ec5240 in Legion::Internal::MessageManager::receive_message (this=0x410590 <wallpressbc+6112>, args=<optimized out>, arglen=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:13575
[198] #14 0x00007fffe1edc82e in Legion::Internal::Runtime::legion_runtime_task (args=<optimized out>, arglen=64, userdata=<optimized out>, userlen=<optimized out>, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:32153
[198] #15 0x00007fffe03cec3d in Realm::LocalTaskProcessor::execute_task (this=0x7513120, func_id=4, task_args=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/proc_impl.cc:1176
[198] #16 0x00007fffe040ef7c in Realm::Task::execute_on_processor (this=0x7f7548acad80, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:326
[198] #17 0x00007fffe0415593 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=0x7f7547fff010) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1687
[198] #18 0x00007fffe041287f in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x72016e0) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1160
[198] #19 0x00007fffe041d9ed in Realm::UserThread::uthread_entry () at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/threads.cc:1405
[198] #20 0x00007fffe4a72600 in ?? () from /lib64/libc.so.6
[198] #21 0x0000000000000000 in ?? ()

The backtrace still looks like a control replication violation. Are you running with the checks when you get that assertion?

Get backtraces from all the nodes with -lg:inorder -ll:force_kthreads with tracing off on 512 nodes. Make sure that they are not changing.

I ran with -lg:safe_ctrlrepl 1 and it did not give any errors like it did before. I will do a run on 512 nodes with -lg:inorder -ll:force_kthreads and tracing off.

512 nodes worked with -lg:inorder -ll:force_kthreads and profiling turned on without tracing.

Ok, and what happens if you run without -lg:inorder?

Please capture backtraces if it does hang.

It crashed when running with just -lg:inorder. Here is a stacktrace:

#0  0x00007fffe4aeccc1 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
#1  0x00007fffe4af29c3 in nanosleep () from /lib64/libc.so.6
#2  0x00007fffe4af28da in sleep () from /lib64/libc.so.6
#3  0x00007fffe0d266d3 in gasneti_freezeForDebuggerNow () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
#4  0x00007fffe0e3933c in gasneti_defaultSignalHandler () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
#5  <signal handler called>
#6  Legion::Deserializer::deserialize<long long> (this=0x7faad50c53a0, element=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_utilities.h:1119
#7  Legion::Internal::ProjectionNode::IntervalTree::deserialize (this=0x7f8630cfa998, derez=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_analysis.cc:2400
#8  0x00007fffe1c78489 in Legion::Internal::ProjectionTreeExchange::unpack_collective_stage (this=<optimized out>, derez=..., stage=<optimized out>)
    at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:16612
#9  0x00007fffe1c7c1a5 in Legion::Internal::AllGatherCollective<false>::unpack_stage (this=0x7f85a1a974a0, stage=stage@entry=0, derez=...)
    at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:12500
#10 0x00007fffe1c7c063 in Legion::Internal::AllGatherCollective<false>::handle_collective_message (this=0x7f85a1a974a0, derez=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_replication.cc:12242
#11 0x00007fffe1e88bc8 in Legion::Internal::VirtualChannel::handle_messages (this=this@entry=0x7ffbc5089800, num_messages=num_messages@entry=1, runtime=0x0, runtime@entry=0xa83eab0, remote_address_space=<optimized out>, 
    remote_address_space@entry=19, args=0x7f74dded8af0 "", args@entry=0x7f74dded8adc "\020\001", arglen=<optimized out>, arglen@entry=48) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:13427
#12 0x00007fffe1e8831f in Legion::Internal::VirtualChannel::process_message (this=0x7ffbc5089800, args=args@entry=0x0, arglen=48, arglen@entry=176417456, runtime=0xa83eab0, remote_address_space=19)
    at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:11789
#13 0x00007fffe1ec5240 in Legion::Internal::MessageManager::receive_message (this=0x7f74dded8af0, args=<optimized out>, arglen=<optimized out>)
    at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:13575
#14 0x00007fffe1edc82e in Legion::Internal::Runtime::legion_runtime_task (args=<optimized out>, arglen=64, userdata=<optimized out>, userlen=<optimized out>, p=...)
    at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:32153
#15 0x00007fffe03cec3d in Realm::LocalTaskProcessor::execute_task (this=0x6f14880, func_id=4, task_args=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/proc_impl.cc:1176
#16 0x00007fffe040ef7c in Realm::Task::execute_on_processor (this=0x7f74ddedf020, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:326
#17 0x00007fffe0415593 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=0x7f8630cfa998) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1687
#18 0x00007fffe041287f in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x79e1480) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1160
#19 0x00007fffe041d9ed in Realm::UserThread::uthread_entry () at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/threads.cc:1405
#20 0x00007fffe4a72600 in ?? () from /lib64/libc.so.6
#21 0x0000000000000000 in ?? ()

Here are stack traces from every rank: http://sapling2.stanford.edu/~seshu/s3d_subrank/dump/

I also saved a core file from the rank that I get that stack trace above from.

Rebuild with -DDEBUG_LEGION_COLLECTIVES and try again.

With -DDEBUG_LEGION_COLLECTIVES it crashes at startup. Heres what I see:

[326] Thread 5 (Thread 0x7ffe4b7d7f80 (LWP 85312) "s3d.x"):                                                                           
[326] #0  0x00007fffe4af274f in wait4 () from /lib64/libc.so.6                                                                        
[326] #1  0x00007fffe4a69ba7 in do_system () from /lib64/libc.so.6                                                                    
[326] #2  0x00007fffe0d21d16 in gasneti_system_redirected () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/langu
age/build/lib/librealm.so.1                                                                                                           
[326] #3  0x00007fffe0d216bb in gasneti_bt_gdb () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/l
ib/librealm.so.1                                                                                                                      
[326] #4  0x00007fffe0d17eef in gasneti_print_backtrace () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[326] #5  0x00007fffe0e3534a in gasneti_defaultSignalHandler () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[326] #6  <signal handler called> 
[326] #7  0x00007fffdea56c0e in ?? () from /opt/rocm-5.1.0/lib/libamdhip64.so.5
[326] #8  0x00007fffdea5b9c0 in ?? () from /opt/rocm-5.1.0/lib/libamdhip64.so.5
[326] #9  0x00007fffdea5d9e3 in ?? () from /opt/rocm-5.1.0/lib/libamdhip64.so.5
[326] #10 0x00007fffdea36bd6 in ?? () from /opt/rocm-5.1.0/lib/libamdhip64.so.5
[326] #11 0x00007fffde9125ff in ?? () from /opt/rocm-5.1.0/lib/libamdhip64.so.5
[326] #12 0x00007fffde912780 in ?? () from /opt/rocm-5.1.0/lib/libamdhip64.so.5
[326] #13 0x00007fffde912898 in ?? () from /opt/rocm-5.1.0/lib/libamdhip64.so.5
[326] #14 0x00007fffde955c52 in hipMemcpy2DAsync () from /opt/rocm-5.1.0/lib/libamdhip64.so.5
[326] #15 0x00007fffe04387ad in Realm::Hip::GPUXferDes::progress_xd (this=<optimized out>, this@entry=0x7fc344162a00, channel=<optimized out>, work_until=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/hip/hip_internal.cc:346
[326] #16 0x00007fffe043e4bd in Realm::XDQueue<Realm::Hip::GPUChannel, Realm::Hip::GPUXferDes>::do_work (this=0x6f888a8, work_until=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/transfer/channel.inl:157
[326] #17 0x00007fffe032b138 in Realm::BackgroundWorkManager::Worker::do_work (this=this@entry=0x7ffe4b7d5c58, max_time_in_ns=<optimized out>, max_time_in_ns@entry=-1, interrupt_flag=interrupt_flag@entry=0x0) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/bgwork.cc:599 
[326] #18 0x00007fffe032aae8 in Realm::BackgroundWorkThread::main_loop (this=0x7572600) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/bgwork.cc:103
[326] #19 0x00007fffe0417cce in Realm::KernelThread::pthread_entry (data=0x7d71da0) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/threads.cc:831
[326] #20 0x00007fffe88936ea in start_thread () from /lib64/libpthread.so.0
[326] #21 0x00007fffe4b28a6f in clone () from /lib64/libc.so.6

Does that happen repeatedly? It either suggests an issue with Realm or with HIP.

Does not appear to be repeatable. The second run did start up but looks like its either progressing very slowly or its hung after 1 timestep. I will let it run for a little longer and dump stack traces over time so we can see.

The run with -DDEBUG_LEGION and tracing off is crashing at 256 nodes with this error:

[56 - 7fff96237f80]   16.819969 {5}{runtime}: [error 93] LEGION ERROR: Unable to find entry for color 9 in index space 200. (from file /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/region_tree.cc:8057)
For more information see:
http://legion.stanford.edu/messages/error_code.html#error_code_93

It looks like this happens at smaller node counts too, so I'd guess you'd want a reproducer on sapling for this?

Yes, if possible. Let's start with a backtrace. Does it happen at start-up or during the same phase of the execution?

It is at start up. The smallest node count it seems to happen at is 8 nodes. 1, 2, and 4 nodes work, so it may not fit on sapling, but I can try.

Here is a stack trace:

[6] Thread 9 (Thread 0x7fff92447f80 (LWP 120137) "s3d.x"):
[6] #0  0x00007fffe4af274f in wait4 () from /lib64/libc.so.6
[6] #1  0x00007fffe4a69ba7 in do_system () from /lib64/libc.so.6
[6] #2  0x00007fffe0af5d16 in gasneti_system_redirected () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/
/legion/language/build/lib/librealm.so.1
[6] #3  0x00007fffe0af56bb in gasneti_bt_gdb () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[6] #4  0x00007fffe0aebeef in gasneti_print_backtrace () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[6] #5  0x00007fffe0c0934a in gasneti_defaultSignalHandler () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//legion/language/build/lib/librealm.so.1
[6] #6  <signal handler called>
[6] #7  0x00007fffe4a5bcbb in raise () from /lib64/libc.so.6
[6] #8  0x00007fffe4a5d355 in abort () from /lib64/libc.so.6
[6] #9  0x00007fffe1d91ea3 in Legion::Internal::Runtime::report_error_message (id=<optimized out>, file_name=<optimized out>, line=<optimized out>, message=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:31930
[6] #10 0x00007fffe1d17508 in Legion::Internal::IndexSpaceNode::get_child (this=0x7ffbc4300b40, c=c@entry=9, defer=defer@entry=0x0, can_fail=false) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/region_tree.cc:8055
[6] #11 0x00007fffe1d17d21 in Legion::Internal::RegionTreeForest::get_index_partition (this=<optimized out>, parent=..., color=9) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/region_tree.cc:923
[6] #12 0x00007fffe1e077a5 in Legion::Internal::Runtime::get_index_partition (this=<optimized out>, parent=..., color=9) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:18057
[6] #13 0x00007fffe2a70e13 in legion_terra_index_cross_product_get_subpartition_by_color_domain_point (runtime_=..., prod=..., color_=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/bindings/regent/regent_partitions.cc:645
[6] #14 0x00007fffe496fbf6 in $<main> () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//build/hept/libregent_tasks.so
[6] #15 0x00007fffe491399d in $__regent_task_main_primary () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//build/hept/libregent_tasks.so
[6] #16 0x00007fffe019ec3d in Realm::LocalTaskProcessor::execute_task (this=0x6d1f310, func_id=1390, task_args=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/proc_impl.cc:1176
[6] #17 0x00007fffe01def7c in Realm::Task::execute_on_processor (this=0x7ffbc4009a90, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:326
[6] #18 0x00007fffe01e5593 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=0x2) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1687
[6] #19 0x00007fffe01e287f in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x6817e40) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1160
[6] #20 0x00007fffe01ed9ed in Realm::UserThread::uthread_entry () at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/threads.cc:1405
[6] #21 0x00007fffe4a72600 in ?? () from /lib64/libc.so.6
[6] #22 0x0000000000000000 in ?? ()

This seems suspicious to me, maybe I should poke around a bit first.

It does seem a bit suspicious. I wouldn't be surprised if the debug modes are perturbing the timing and exercising the bug. Unfortunately I suspect it's going to be unrelated to the hang later in the program.

I am trying to setup a reproducer on sapling but I'm having an issue with the mapper. I'm seeing this:

Process 579578 on node c0001.stanford.edu is frozen!
[9 - 7f8814308c40]    5.841243 {5}{runtime}: [error 67] LEGION ERROR: Mapper RHST Mapper for Processor 1d00090000000001
 provided invalid target_processors from call to 'map_task' for replicated task main (UID 25). Replicated tasks are onl
y permitted to have one target processor and it must be exactly 'input.shard_procesor' as that is where this replicated
 copy of the task has been assigned to run by this same mapper. (from file /scratch2/seshu/legion_s3d_subranks/legion/r
untime/legion/legion_tasks.cc:8035)
For more information see:                                  
http://legion.stanford.edu/messages/error_code.html#error_code_67

It does not make sense to me because I'm doing this in map_task:

    if(task.task_id == REGENT_MAIN_TASK_ID)
    {
      output.target_procs.push_back(input.shard_processor);
    }
    else
    {
      output.target_procs = local_cpus;
    }

Don't push back, the vector has already been prepopulated for you.

assert(output.target_procs.size() == 1);
assert(output.target_procs.back() == input.shard_processor);

I do not seem to be able to reproduce the missing entry error on sapling. All the runs I tried worked.

I tried turning on spy on Frontier so I could see what index space 200 is but the run that was crashing before started working with spy.

Crashing somewhere different?

No crash at 8 nodes on Frontier with spy, but it crashes at 256 nodes. I do not see what index space 200 is in the spy logs though.

Where are the spy logs?

Are you sure you are on the most recent control replication commit? You should no longer be getting these warnings with most recent master or control replication branch:

[0 - 7fffcb32ff80]    0.009328 {4}{realm}: Unable to send shared memory information to node 1, skipping
[0 - 7fffcb32ff80]    0.009379 {4}{realm}: Unable to send shared memory information to node 2, skipping
[0 - 7fffcb32ff80]    0.009384 {4}{realm}: Unable to send shared memory information to node 3, skipping
[0 - 7fffcb32ff80]    0.009388 {4}{realm}: Unable to send shared memory information to node 4, skipping
[0 - 7fffcb32ff80]    0.009391 {4}{realm}: Unable to send shared memory information to node 5, skipping
[0 - 7fffcb32ff80]    0.009395 {4}{realm}: Unable to send shared memory information to node 6, skipping
[0 - 7fffcb32ff80]    0.009399 {4}{realm}: Unable to send shared memory information to node 7, skipping
[0 - 7fffcb32ff80]    0.009402 {4}{realm}: Unable to send shared memory information to node 8, skipping
[0 - 7fffcb32ff80]    0.009405 {4}{realm}: Unable to send shared memory information to node 9, skipping
...

Index Space 200 does not appear to have an application provided name. It is a 3-D index space with bound (0,0,0) - (63,63,1023).

Looks like index space 200 is a subspace of index partition 100 called lp_int_rank which is a partition of index space 100 called is_grid.

Do you get the error when you run with -lg:inorder?

It works with -lg:inorder. Does this suggest I need to put in a fence somewhere?

What is the pattern of partition calls that leads to the failure? I think in general we should be getting away from using fences. I think Legion in general should be handling things. The only place where we don't help you is if you create partitions in sub-tasks and then propagate the names back up the task tree because we can't see those dependences in the runtime and so you have to synchronize those yourself. Do you think that is happening here or are all your partitions being made in the top-level task?

The stencil partition is made in a leaf task. The rest of partitions are either made in the top-level task directly or inlined by Regent into the top-level task. There is actually an execution fence at the end of the partitioning calls.

The stencil partition is made in a leaf task. The rest of partitions are either made in the top-level task directly or inlined by Regent into the top-level task.

Ok, so is this trying to get the stencil partition in the top-level task or are we getting a different partition in the top-level task?

There is actually an execution fence at the end of the partitioning calls.

Do you actually wait on the future from that execution fence in the top-level task before doing this call to get the child partition? It looks like the call is coming from the top-level task.

The stencil partition is never explicitly used in the top level task and the leaf task that makes the partition never returns it or saves it. It should only be used in the C++ projection functor though get_logical_partition_by_color.

I do not explicitly wait on the future from the execution fence, I assumed Regent did that automatically when you do: __fence(__execution)

I will try minimizing S3D to just the partitioning and see what happens.

A fence like __fence(__execution) is not blocking. If you want to block you need to say so like __fence(__execution, __block) .

The stencil partition is never explicitly used in the top level task and the leaf task that makes the partition never returns it or saves it. It should only be used in the C++ projection functor though get_logical_partition_by_color.

Ok, so what is making this partition that the top-level task is trying to get?

lp_int_rank is just made by a restrict call:
var lp_int_rank = restrict(disjoint, complete, lr_int, t, e, is_rank)

I changed the fence to a blocking fence and it seemed like the error went away, doing some more runs now though to be sure.

lp_int_rank is not the problem. The problem is that the top-level task is trying to get a partition of a sub-region of lp_int_rank. We need to know what made that partition of the subregion of lp_int_rank.

I guess below that there is lp_int_subrank. That one does have a future that we should be blocking on though? It looks like that is actually made by leaf task that is not being inlined:

https://gitlab.com/legion_s3d/legion_s3d/-/blob/subranks/rhst/init_tasks.rg?ref_type=heads#L464

https://gitlab.com/legion_s3d/legion_s3d/-/blob/subranks/rhst/init_tasks.rg?ref_type=heads#L426

That sounds more like the problem. If you launch a sub-task to create a partition and you need to name that partition in the parent task then you have to synchronize with that leaf task because there's no way for the runtime to see the dependence of that being passed up the task tree otherwise.

Why would it block there? How does that relate to this line in the backtrace?

[6] #13 0x00007fffe2a70e13 in legion_terra_index_cross_product_get_subpartition_by_color_domain_point (runtime_=..., prod=..., color_=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/bindings/regent/regent_partitions.cc:645

I see the problem. It's here:

https://gitlab.com/legion_s3d/legion_s3d/-/blob/subranks/rhst/init_tasks.rg?ref_type=heads#L464

This task is launched in a loop. Only the last future is saved. Therefore, when we block (at the line I linked last time) it will only block on the last future. Any other subtasks will not be synchronized.

One hack would be to use min= instead of regular assignment on subrank_color. (Note: I chose min because legion_color_t is unsigned so -1 is really the maximum value.) It's a useless reduction because all the values are the same but it will achieve the purpose of creating a true data flow in the program, which Regent will ensure blocks in the correct place to ensure the right synchronization.

TL;DR: if you don't have true data flows in your program, Legion/Regent can do strange things.

Is there a reason that loop isn't turned into an index space task launch and then you can wait on the future map?

You can't index launch if you're doing an unconditional assignment.

for i in ... do
  x = f(...) -- can't turn this into an index launch
end

You have to choose some sort of a reduction operator to make this valid. Because this code just happens to return the same value out of every task (or should), two easy operators are min and max (and then you just need to choose an appropriate initial value to make sure you get what you want).

Ok, hopefully @syamajala saw your comment since he told me in a meeting he did turn it into an index launch (not sure how). I guess Regent doesn't have a future map concept that you can then wait on to make sure all the futures are ready.

That was what I did. I used min=.

I needed to implement premap_task in the mapper for this change. It worked at 1 node, but 8 nodes it hit a HIP HSA_STATUS_ERROR_OUT_OF_RESOURCES after 2 timesteps. At 256 and 512 nodes it looks like it hung at startup. I will try to dump stack traces from the 256 node run since its still running.

I see some threads stuck here:

Thread 23 (Thread 0x7ffd3fffef80 (LWP 30940) "s3d.x"):
#0  0x00007fffe4b21ea9 in syscall () from /lib64/libc.so.6
#1  0x00007fffdfe563f7 in Realm::Doorbell::wait_slow (this=0x7ffd3fffe200) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/mutex.cc:265
#2  0x00007fffdfded1aa in Realm::KernelThreadTaskScheduler::worker_sleep (this=0x6d21760, switch_to=0x7ffa84108150) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1469
#3  0x00007fffdfdea5fa in Realm::ThreadedTaskScheduler::thread_blocking (this=0x6d21760, thread=0x7ff6c0000dc0) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:951
#4  0x00007fffdfd1b381 in Realm::Thread::wait_for_condition<Realm::EventTriggeredCondition> (cond=..., poisoned=@0x7ffd3fffaf3f: false) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/threads.inl:218
#5  0x00007fffdfd0c560 in Realm::Event::wait_faultaware (this=0x7ffd3fffb210, poisoned=@0x7ffd3fffaf3f: false) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/event_impl.cc:242
#6  0x00007fffdfd0c04c in Realm::Event::wait (this=0x7ffd3fffb210) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/event_impl.cc:194
#7  0x00007fffe2a76d20 in Legion::Internal::LgEvent::wait (this=0x7ffd3fffb210) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_types.h:3284
#8  0x00007fffe19646fd in Legion::Internal::RegionTreeNode::attach_semantic_information (this=0x7ffa8415f020, tag=<optimized out>, tag@entry=0, source=source@entry=253, buffer=buffer@entry=0x7fffe481810e, size=size@entry=11, is_mutable=false, local_only=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/region_tree.cc:15861
#9  0x00007fffe1964874 in Legion::Internal::RegionTreeForest::attach_semantic_information (this=0x8d693b0, handle=..., tag=0, source=4, buffer=0x7fffe481810e, size=0, is_mutable=false, local_only=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/region_tree.cc:5289
#10 0x00007fffe1a3dc58 in Legion::Internal::Runtime::attach_semantic_information (this=0x8ce3990, handle=..., tag=0, buffer=0x7fffe481810e, size=11, is_mutable=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:20247
#11 0x00007fffe14eb5fe in legion_logical_partition_attach_name (runtime_=..., handle_=..., name=0x0, is_mutable=false) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_c.cc:2476
#12 0x00007fffe48f1392 in $<init_part> () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//build/hept/libregent_tasks.so
#13 0x00007fffe48f0b63 in $__regent_task_init_part_primary () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//build/hept/libregent_tasks.so
#14 0x00007fffdfda7c3d in Realm::LocalTaskProcessor::execute_task (this=0x698ec40, func_id=39685, task_args=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/proc_impl.cc:1176
#15 0x00007fffdfde7f7c in Realm::Task::execute_on_processor (this=0x7fcfb82ce360, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:326
#16 0x00007fffdfdece23 in Realm::KernelThreadTaskScheduler::execute_task (this=<optimized out>, task=0x7ffd3fffe200) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1421
#17 0x00007fffdfdeb87f in Realm::ThreadedTaskScheduler::scheduler_loop (this=this@entry=0x6d21760) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1160
#18 0x00007fffdfdf08b7 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x6d21760) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1272
#19 Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x6d21760) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/threads.inl:97
#20 0x00007fffdfdf4cce in Realm::KernelThread::pthread_entry (data=0x7ff6c0000dc0) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/threads.cc:831
#21 0x00007fffe88936ea in start_thread () from /lib64/libpthread.so.0
#22 0x00007fffe4b28a6f in clone () from /lib64/libc.so.6

and some here:

Thread 6 (Thread 0x7ffd3dff4f80 (LWP 94074) "s3d.x"):
#0  0x00007fffe4b21ea9 in syscall () from /lib64/libc.so.6
#1  0x00007fffdfe563f7 in Realm::Doorbell::wait_slow (this=0x7ffd3dff4200) at /lustre/orion/cmb138/scratch/seshuy/legio
n_s3d_subrank/legion/runtime/realm/mutex.cc:265
#2  0x00007fffdfded1aa in Realm::KernelThreadTaskScheduler::worker_sleep (this=0x6d21760, switch_to=0x7fe838000dc0) at 
/lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1469
#3  0x00007fffdfdea5fa in Realm::ThreadedTaskScheduler::thread_blocking (this=0x6d21760, thread=0x8cb6e10) at /lustre/o
rion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:951
#4  0x00007fffdfd1b381 in Realm::Thread::wait_for_condition<Realm::EventTriggeredCondition> (cond=..., poisoned=@0x7ffd
3df9a7ff: false) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/threads.inl:218
#5  0x00007fffdfd0c560 in Realm::Event::wait_faultaware (this=0x7ffd3df9a800, poisoned=@0x7ffd3df9a7ff: false) at /lust
re/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/event_impl.cc:242
#6  0x00007fffe16154a3 in Legion::Internal::LgEvent::wait_faultaware (this=0x7ffd3df9a800, poisoned=@0x7ffd3df9a7ff: fa
lse, from_app=true) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_types.h:3365
#7  0x00007fffe19bf311 in Legion::Internal::ApEvent::wait_faultaware (this=0x7ffd3df9a800, poisoned=@0x7ffd3df9a7ff: fa
lse) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_types.h:2927
#8  Legion::Internal::FutureImpl::wait (this=0x7fe8381507c0, silence_warnings=<optimized out>, warning_string=<optimize
d out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:982
#9  0x00007fffe19bc0f6 in Legion::Internal::FutureImpl::get_buffer (this=0x7fe8381507c0, memory=..., extent_in_bytes=0x
7ffd3df9d8c0, check_extent=false, silence_warnings=false, warning_string=0x4 <error: Cannot access memory at address 0x
4>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/runtime.cc:1029
#10 0x00007fffe163a5dd in Legion::Future::get_buffer (this=<optimized out>, memory=Realm::Memory::SYSTEM_MEM, extent_in
_bytes=0x7ffd3df9d8c0, check_size=false, silence_warnings=false, warning_string=0x0) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion.cc:2449
#11 0x00007fffe14ecd0e in Legion::Future::get_untyped_pointer (this=0x7ffd3dff4200, silence_warnings=false, warning_string=0x0) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion.inl:19442
#12 legion_future_get_untyped_pointer (handle_=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/legion/legion_c.cc:3234
#13 0x00007fffe493c767 in $<main> () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//build/hept/libregent_tasks.so
#14 0x00007fffe491405d in $__regent_task_main_primary () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank//build/hept/libregent_tasks.so
#15 0x00007fffdfda7c3d in Realm::LocalTaskProcessor::execute_task (this=0x698ec40, func_id=44807, task_args=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/proc_impl.cc:1176
#16 0x00007fffdfde7f7c in Realm::Task::execute_on_processor (this=0x7ffa80001170, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:326
#17 0x00007fffdfdece23 in Realm::KernelThreadTaskScheduler::execute_task (this=<optimized out>, task=0x7ffd3dff4200) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1421
#18 0x00007fffdfdeb87f in Realm::ThreadedTaskScheduler::scheduler_loop (this=this@entry=0x6d21760) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1160
#19 0x00007fffdfdf08b7 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x6d21760) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/tasks.cc:1272
#20 Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x6d21760) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/threads.inl:97
#21 0x00007fffdfdf4cce in Realm::KernelThread::pthread_entry (data=0x8cb6e10) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_subrank/legion/runtime/realm/threads.cc:831
#22 0x00007fffe88936ea in start_thread () from /lib64/libpthread.so.0
#23 0x00007fffe4b28a6f in clone () from /lib64/libc.so.6

There are some stack traces here: http://sapling2.stanford.edu/~seshu/s3d_subrank/startup/

What happens if you run with -lg:inorder? I'm going to be honest that looks like the network is getting stuck. The attach semantic information API calls are especially bad, because they are fire-and-forget active messages and don't need to synchronize with anything. Pretty much the only reason that they can hang are because the network is stuck.

Doesn't seem like it changed anything still see stack traces with attach_semantic_information in them.

I went back to the old version and just changed this fence to a blocking fence: https://gitlab.com/legion_s3d/legion_s3d/-/blob/subranks/rhst/s3d.rg?ref_type=heads#L2224

It seems like that also fixes the missing color error and its running at smaller node counts. I will submit 256 nodes. Is that fix good enough?

256 nodes ran to completion. no tracing, but with -ll:force_kthreads and -DDEBUG_LEGION. Seems like we're not hitting the projection tree exchange crash anymore? Should I turn on tracing?

Tried turning on tracing again, dont seem to make it past 10 timesteps at 256 nodes. I dumped some stacktraces over time here: http://sapling2.stanford.edu/~seshu/s3d_subrank/tracing/

I went back to the old version

What is the old version?

256 nodes ran to completion. no tracing, but with -ll:force_kthreads and -DDEBUG_LEGION. Seems like we're not hitting the projection tree exchange crash anymore?

I'm presuming this is with -DDEBUG_LEGION_COLLECTIVES also? Can we try turning off the debug collectives and the force kthreads and see if that is stable? Make sure it gets through at least a few times like that.

Tried turning on tracing again, dont seem to make it past 10 timesteps at 256 nodes. I dumped some stacktraces over time here:

There are some interesting backtraces here to investigate from -DDEBUG_LEGION_COLLECTIVES. Will report back with what I find.

The old version as in the old partitioning code:
https://gitlab.com/legion_s3d/legion_s3d/-/blob/subranks/rhst/init_tasks.rg?ref_type=heads#L464

I just put a blocking fence in the main task.

These runs were just with -DDEBUG_LEGION. I will do one with -DDEBUG_LEGION_COLLECTIVES and see if I can get some stack traces.

Huh, ok, I guess the backtraces didn't have debug collective on. I'm still investigating them, there's certainly a pattern to them, but not sure what it is yet...

See if you can get it to hang with -DDEBUG_LEGION_COLLECTIVES. It's definitely looking to me like the network is stuck and messages are not being received on some nodes. Everything is roughly stuck in the same place where Legion has issued a bunch of independent asynchronous collective operations that will almost certainly flood the network in different orders on different nodes. All those messages should be able to make independent forward progress, but it seems like that is not happening.

The other thing you can try is running with -lg:unsafe_mapper which should cut down on some of the parallel collective communications for checking the mapper output.

I will submit some runs this evening.

Also, I backed up the build of S3D that I scaled to 8192 nodes to tape. I will retrieve it and see if it still works. That would at least tell us if its the network or not right?

That would at least tell us if its the network or not right?

If it also hung that would probably suggest something has changed in the system that is bad for us. Unfortunately the contrapositive is not necessarily true, because this version of Legion has a different communication profile than the previous one (in particular some of the backtraces are suggestive of the new parts of Legion's logical dependence analysis that changed with the merge of shardrefine). So it is possible that the system is the same, and we're just pushing on it now in a way that is even worse than before and exercising the issues in the network again.

The logs from -DDEBUG_LEGION_COLLECTIVES with -lg:unsafe_mapper are here: http://sapling2.stanford.edu/~seshu/s3d_subrank/debug_collective/

These don't look hung at all. There are still a bunch of differences between dump2 and dump3. It doesn't look to me like this is actually hanging. How far apart where the dumps?

It could be that it was very slow. There were runs at 256 nodes before where it was taking 10+ minutes to do 10 timesteps. I think those dumps were a few minutes apart at least. I resubmitted a run and will let it run longer and see what happens. It looks like frontier is busy today though.

Also I forgot. I looked at the changelog for Frontier. It says the last time they updated the network stack was in September. I did my scaling runs in November. So we'll see if the old build from November still works.