StanfordLegion/legion

`finder != operations.end()` assertion failure

Closed this issue · 5 comments

Running at 4 nodes, debug mode, with GPUs. This regression was introduced by https://gitlab.com/StanfordLegion/legion/-/commit/12d5a56fe5b07975c2f1d70b4df156fb9c684949

prometeo_ConstPropMix.exec: /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/legion/legion_trace.cc:10301: virtual void Legion::Internal::IssueCopy::execute(std::vector<Legion::Internal::ApEvent>&, std::map<unsigned int, Legion::Internal::ApUserEvent>&, std::map<Legion::Internal::ContextCoordinate, Legion::Internal::MemoizableOp*>&, bool): Assertion `finder != operations.end()' failed.

backtrace:

#0  0x00007fd8b7cc89fd in nanosleep () from /lib64/libc.so.6
#1  0x00007fd8b7cc8894 in sleep () from /lib64/libc.so.6
#2  0x00007fd8bb3f9086 in Realm::realm_freeze (signal=6) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/runtime_impl.cc:206
#3  <signal handler called>
#4  0x00007fd8b7c39387 in raise () from /lib64/libc.so.6
#5  0x00007fd8b7c3aa78 in abort () from /lib64/libc.so.6
#6  0x00007fd8b7c321a6 in __assert_fail_base () from /lib64/libc.so.6
#7  0x00007fd8b7c32252 in __assert_fail () from /lib64/libc.so.6
#8  0x00007fd8ba87bc09 in Legion::Internal::IssueCopy::execute (this=0x7fd824146010, events=..., user_events=..., operations=..., recurrent_replay=false)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/legion/legion_trace.cc:10301
#9  0x00007fd8ba8610a6 in Legion::Internal::PhysicalTemplate::execute_slice (this=0x7fd8259474e0, slice_idx=0, recurrent_replay=false)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/legion/legion_trace.cc:4703
#10 0x00007fd8ba870e40 in Legion::Internal::PhysicalTemplate::handle_replay_slice (args=0x7fd8183ab600) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/legion/legion_trace.cc:7998
#11 0x00007fd8baca3134 in Legion::Internal::Runtime::legion_runtime_task (args=0x7fd8183ab600, arglen=20, userdata=0x447bd10, userlen=8, p=...)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/legion/runtime.cc:32556
#12 0x00007fd8bb72bd44 in Realm::LocalTaskProcessor::execute_task (this=0x445db60, func_id=4, task_args=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/proc_impl.cc:1176
#13 0x00007fd8bb569526 in Realm::Task::execute_on_processor (this=0x7fd8183ab480, p=...) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/tasks.cc:326
#14 0x00007fd8bb56d43e in Realm::KernelThreadTaskScheduler::execute_task (this=0x445ded0, task=0x7fd8183ab480) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/tasks.cc:1421
#15 0x00007fd8bb56c2bc in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x445ded0) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/tasks.cc:1160
#16 0x00007fd8bb56c8d2 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x445ded0) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/tasks.cc:1272
#17 0x00007fd8bb573b4a in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x445ded0)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/threads.inl:97
#18 0x00007fd8bb543e73 in Realm::KernelThread::pthread_entry (data=0x7fd81546df90) at /home/hpcc/gitlabci/psaap-ci/artifacts/6619663627/legion/runtime/realm/threads.cc:831
#19 0x00007fd8b77e6ea5 in start_thread () from /lib64/libpthread.so.0
#20 0x00007fd8b7d01b0d in clone () from /lib64/libc.so.6

@lightsighter

note to self: this is channel flow 8x2x2

Make me a reproducer on sapling. You guys had two months to test this. Why are you just reporting it now?

It doesn't reproduce on sapling i.e. ChannelFlow, 8x2x2, 4 nodes, GPUs, debug mode using HTR Develop branch commit fbaf5141, legion commit 12d5a56.

I also cannot reproduce on sapling as well as Lassen. The cluster I found the error on is down this week so I will need to check again once it's back up.

This error should be deterministic when it does occur. Are we sure we are running exactly the same configuration on both machines?

No longer able to reproduce