NVIDIA/stdexec

`static_thread_pool` is not work-conserving

ot opened this issue ยท 12 comments

ot commented

The current implementation of static_thread_pool might use a subset of the available threads even when there are tasks scheduled. The work-stealing mechanism only runs on threads that are running and would otherwise transition to idle; if a thread is already idle, nothing will wake it up even if other threads have non-empty queues (EDIT: sibling threads can also be notified in-between tasks, but that depends on waiting for the current task to complete).

The result is that the scheduler is not work-conserving; this can cause both correctness issues and pathological performance.

For an example of incorrectness, the following code should terminate but it instead deadlocks:

  static_thread_pool pool(2);
  auto sched = pool.get_scheduler();

  auto parent = [&] {
    // parent is occupying one thread, but the pool should have more
    // available for this to complete.
    sync_wait(on(sched, just()));
  };
  sync_wait(on(sched, just() | then(parent)));
  std::printf("Done\n");

This happens because the child task is scheduled from a scheduler thread so it goes directly into its local queue without waking up any of the other threads. Note that the same code does not deadlock with libunifex, which uses a more primitive scheduler (single locked queue) which is however work-conserving.

For an example of pathological performance, the following code spawns N long-running tasks on a pool of N threads, so they should all run in parallel

  const size_t num_threads = 8;
  static_thread_pool pool(num_threads);
  auto sched = pool.get_scheduler();

  async_scope scope;

  auto parent = [&] {
    std::printf("Parent\n");
    auto child = [] {
      std::printf("Child\n");
      // Simulate a long-running (possibly CPU-intensive) task with a
      // sleep.
      std::this_thread::sleep_for(std::chrono::seconds(1));
    };

    for (size_t i = 0; i < num_threads; ++i) {
      scope.spawn(on(sched, just() | then(child)));
    }
  };

  sync_wait(on(sched, just() | then(parent)));
  sync_wait(scope.on_empty());
  std::printf("Done\n");

Instead, they run sequentially, so the program takes 8 seconds instead of the expected 1 second (libunifex correctly takes 1 second).

Generally, work-stealing schedulers have some mechanism to wake up siblings when a thread queue is non-empty. However, it is tricky to do this efficiently, because this requires frequent cross-thread synchronization; if done naively, the local thread and a sibling will compete for each newly scheduled task. So it may not be easy to fix without affecting performance.

Full repro: https://gist.github.com/ot/64712433d33dab75559731f84dba5304

On an unrelated note, the remote queue management seems inefficient: if I am understanding correctly, each remote thread will create a queue for each thread in the scheduler. So if we have two schedulers with N threads scheduling tasks into each other, each will have N^2 remote queues, which seems expensive to poll when collecting remote tasks.

Generally, work-stealing schedulers have some mechanism to wake up siblings when a thread queue is non-empty. However, it is tricky to do this efficiently, because this requires frequent cross-thread synchronization; if done naively, the local thread and a sibling will compete for each newly scheduled task. So it may not be easy to fix without affecting performance.

From my understanding, other threads will be notified if the local bwos-queue has at least one stealable batch. You are correct that each bwos-configuration has its own pathological setups. Is there a particular defect in the current implementation of the algorithm where siblings are not woken up when they should, according to the algorithm?

On an unrelated note, the remote queue management seems inefficient: if I am understanding correctly, each remote thread will create a queue for each thread in the scheduler. So if we have two schedulers with N threads scheduling tasks into each other, each will have N^2 remote queues, which seems expensive to poll when collecting remote tasks.

Here I agree with you. It was a quick solution which I wanted to address at another pass through, which never happend. My focus shifted away from the thread pool. I think in the original papers to bwos the authors used a simple pool-wide queue to take work from.

The constructor of the pool should take bwos parameters as an optional argument. If you set the block size to 1 you should get a work conserving scheduler.

ot commented

The constructor of the pool should take bwos parameters as an optional argument. If you set the block size to 1 you should get a work conserving scheduler.

With this:

static_thread_pool pool(num_threads, bwos_params{.blockSize = 1});

the deadlock still repros. The other example gets resolved, but I can recreate the pathological behavior by tweaking the example a bit and scheduling the tasks on a chain: note that the sleep happens after the scheduling, so all the tasks should start executing almost instantaneously, instead they run sequentially.

  async_scope scope;

  std::function<void(int)> child = [&](int remaining) {
    std::printf("Child %d\n", remaining);
    if (remaining) {
      scope.spawn(on(sched, just(remaining - 1) | then(child)));
    }
    // Simulate a long-running (possibly CPU-intensive) task with a
    // sleep.
    std::this_thread::sleep_for(std::chrono::seconds(1));
  };

  scope.spawn(on(sched, just(num_threads - 1) | then(child)));
  sync_wait(scope.on_empty());

other threads will be notified if the local bwos-queue has at least one stealable batch

As far as I understand that happens in pop() (-> notify_one_sleeping()), right? So if the thread is already processing a task that enqueues more tasks (and all other threads are idle), those tasks cannot be stolen until the current task completes.

ot commented

But I was incorrect here

The work-stealing mechanism only runs on threads that are running and would otherwise transition to idle; if a thread is already idle, nothing will wake it up even if other threads have non-empty queues

since it also happens on pop(), I'll edit the issue.

I have to take a deeper look but that sounds like a bug to me. I expect this configuration to work in your examples.
Scheduling a new task from within a worker thread should first try to push into the bwos queue and if this is full, push into some "other place".

inline void static_thread_pool_::thread_state::push_local(task_base* task) {
      if (!local_queue_.push_back(task)) {
        pending_queue_.push_back(task);
      }
    }

Currently this "pending" is a non-stealable worker-private queue. If you block one worker thread for ever this could again lead to the problem that work is not being processed forever. So having a finite queue for stealable work and a worker-private queue seems like a bad combination. I will reconsider this idea.

And there should always be a thread that is looking for a chance to steal a work item. IIRC, if a thread goes to sleep, it checks whether its the last thief and wakes up a sleeping one, if necessary.

But again, I will take a look in, ASAP. Probably tomorrow.

ot commented

Scheduling a new task from within a worker thread should first try to push into the bwos queue and if this is full, push into some "other place"

I think this is a red herring: the local queue in these cases does not get full.

The problem here is that "number of tasks" is not a good heuristic for running time, if individual tasks are unbounded, so there is no guarantee on how frequently the sibling threads can be awoken. I think that there is an implicit assumption that tasks are small, so both the inability to steal part of a block and the possibly infrequent opportunities to steal are not a large problem. But if we have arbitrarily large tasks, some pathologies can emerge.

IIRC, if a thread goes to sleep, it checks whether its the last thief and wakes up a sleeping one, if necessary.

I might have missed this. So does the polling continue indefinitely, even if all the queues are empty?

I might have missed this. So does the polling continue indefinitely, even if all the queues are empty?

yes, IIRC. The invariant was, if possible, there is always at least one thief.

See clear_stealing()

   inline void static_thread_pool_::thread_state::clear_stealing() {
      if (pool_->numThiefs_.fetch_sub(1, std::memory_order_relaxed) == 1) {
        notify_one_sleeping();
      }
    }

Scheduling a new task from within a worker thread should first try to push into the bwos queue and if this is full, push into some "other place"

I think this is a red herring: the local queue in these cases does not get full.

I understand this. But I think even if we fix the bug and your example works for block size = 1, the worker private pending queue is still a source for problems.

The problem here is that "number of tasks" is not a good heuristic for running time, if individual tasks are unbounded, so there is no guarantee on how frequently the sibling threads can be awoken. I think that there is an implicit assumption that tasks are small, so both the inability to steal part of a block and the possibly infrequent opportunities to steal are not a large problem. But if we have arbitrarily large tasks, some pathologies can emerge.

Yes, i think the whole bwos idea optimizes for lots of small tasks. Optimal usage requires to chose bwos parameters for its applications.
Nethertheless, I expected single sized blocks to work.

IIRC, if a thread goes to sleep, it checks whether its the last thief and wakes up a sleeping one, if necessary.

I might have missed this. So does the polling continue indefinitely, even if all the queues are empty?

I think the problem is that when I execute the Task its already dequeued and there is place for its continuation in the unstealable Slot. I will change that order to first execute then dequeue

ot commented

The invariant was, if possible, there is always at least one thief

I see, but this means that we're constantly spinning through the threads even when there is no work to do. I just verified that a program that creates the pool and then sleeps without enqueuing any tasks runs at 100% CPU utilization (1 core worth of CPU). Is that desirable?

The invariant was, if possible, there is always at least one thief

I see, but this means that we're constantly spinning through the threads even when there is no work to do. I just verified that a program that creates the pool and then sleeps without enqueuing any tasks runs at 100% CPU utilization (1 core worth of CPU). Is that desirable?

A totally empty pool should probably be put to sleep. :-)

Im not so sure in the other cases but im happy to revisit the spin loops in any case. They should at least emit pause instructions and yield.

All suggestions, or even PRs, are highly appreciated.

ot commented

All suggestions, or even PRs, are highly appreciated.

I wish I had suggestions ๐Ÿ˜„ I noticed this because I'm looking at a few work-stealing implementations to see how they deal with sibling wake-ups, and all I've seen so far is different variations of polling, possibly throttled to trade off some latency for overhead.