Realm: cancel_operation crashes with profiling

Question

Realm: cancel_operation crashes with profiling

eddy16112 opened this issue 7 months ago · 3 comments

I have tried to remove the unstable sleep with events for the test_profiling test.
The original code:

    cargs.sleep_useconds = 5000000;
    Event e4 = task_proc.spawn(CHILD_TASK, &cargs, sizeof(cargs), prs);
    sleep(2);
    int info = 111;
    e4.cancel_operation(&info, sizeof(info));
    bool poisoned = false;
    e4.wait_faultaware(poisoned);
    assert(poisoned);

The new one:

    cargs.sleep_useconds = 5000000;
    UserEvent u = UserEvent::create_user_event();
    cargs.wait_on = u;
    UserEvent trigger_event = UserEvent::create_user_event();
    cargs.trigger_event = trigger_event;
    Event e4 = task_proc.spawn(CHILD_TASK, &cargs, sizeof(cargs), prs);
    trigger_event.wait();
    int info = 111;
    e4.cancel_operation(&info, sizeof(info)); // make sure the cancel is called after CHILD_TASK is launched (using trigger_event.wait();), but before it is finished (using u.trigger()).
    u.trigger();
    bool poisoned = false;
    e4.wait_faultaware(poisoned);
    assert(poisoned);

However, a new bug is trigged https://gitlab.com/StanfordLegion/legion/-/jobs/5856518802

test_profiling: /builds/StanfordLegion/legion/runtime/realm/tasks.cc:1189: void Realm::ThreadedTaskScheduler::scheduler_loop(): Assertion `yield_to != Thread::self()' failed.
Signal 6 received by node 1, process 19525 (thread 7f7f6432ec00) - obtaining backtrace
Signal 6 received by process 19525 (thread 7f7f6432ec00) at: stack trace: 9 frames
  [0] = unknown symbol at unknown file:0 [00007f7f895b141f]
  [1] = raise at ../sysdeps/unix/sysv/linux/raise.c:51 [00007f7f88fa100b]
  [2] = abort at /build/glibc-wuryBv/glibc-2.31/stdlib/abort.c:79 [00007f7f88f80858]
  [3] = __assert_fail_base.cold at /build/glibc-wuryBv/glibc-2.31/assert/assert.c:92 [00007f7f88f80728]
  [4] = __assert_fail at /build/glibc-wuryBv/glibc-2.31/assert/assert.c:101 [00007f7f88f91fd5]
  [5] = Realm::ThreadedTaskScheduler::scheduler_loop() at /builds/StanfordLegion/legion/runtime/realm/tasks.cc:1189 [00005591947c59f0]
  [6] = void Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop>(void*) at /builds/StanfordLegion/legion/runtime/realm/threads.inl:97 [00005591947ce94d]
  [7] = Realm::UserThread::uthread_entry() at /builds/StanfordLegion/legion/runtime/realm/threads.cc:1355 [00005591947de1a7]
  [8] = unknown symbol at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 [00007f7f88fb94df]

Here is the PR to reproduce the bug https://gitlab.com/StanfordLegion/legion/-/merge_requests/1049
We decided to disable the cancel_operation test cases, and use this issue to track the bug.

Answer 1 · 2024-01-23T02:36:27.000Z

However, a new bug is trigged

This seems like a real bug isn't it? That assertion is in the task scheduler and is saying that we're not on the thread that we thought we were on.

Answer 2 · 2024-01-23T23:46:17.000Z

Yes, it is a real bug if nothing wrong in my test code. We do not have stress tests for canceling events, so we did not catch the bug before. I just create this issue to reminder us to pick the bug up later.

Answer 3 · 2024-01-24T06:02:56.000Z

Ok, sounds good.