improve performance at scale of `exec::static_thread_pool`
ericniebler opened this issue · 6 comments
The design of stdexec's static_thread_pool
's is similar to the one in libunifex, which apparently suffers at scale. It's safe to assume that our static_thread_pool
has similar issues.
Investigate the issues discussed here and implement a solution that performs better.
See this reddit comment from someone who believes that techniques used by the Go scheduler would be applicable here.
Why doesn't every worker thread have its own queue that can never be accessed by other threads? If push
is called from a worker thread, then the work gets put onto that thread's queue with no locking needed. (If push
is called from a non-worker thread we do what we're doing now.) When pop
is called we first try to dequeue an item from the thread-local queue, and if it's empty, we fall back to pulling from one of the shared queues.
I suppose this risks starvation of work items placed in the shared queues, which are essentially treated as if they have lower priority. We could probably tune that by having the worker threads occasionally pull from the shared queue even if their local queue has work to do. Or else have one or more threads only pulling from the shared queues.
We could probably tune that by having the worker threads occasionally pull from the shared queue even if their local queue has work to do. ... Or else have one or more threads only pulling from the shared queues.
This could introduce performance regressions on systems with NUMA. If we consider bulk
in repeat_n
, having threads pulling from non-local queues would introduce NUMA traffic.
@ericniebler I don't think it is necessary to invent a scheduler from scratch. I would like to bring to your attention the following work: https://github.com/taskflow/taskflow. It has a highly efficient task-stealing scheduler. It is distributed under the MIT license. If a similar efficient scheduling algorithm was available in stdexec, that would be great. The publication about it is here: https://tsung-wei-huang.github.io/papers/tpds21-taskflow.pdf. When it comes to task stealing on machines with NUMA domains, one can prioritize stealing from queues on the same numa domain and only if unsuccessful, go on to steal from the queues in the other NUMA domain. This would effectively minimize the NUMA traffic that @senior-zero mentioned.
EDIT: Modern CPUs effectively have NUMA even in the same NUMA domain. If you take AMD's latest CPUs, they come in groups of 8 on a single chiplet. This idea of stealing from close neighbors can be applied at those scales in a hierarchical manner as well.
I don't think it is necessary to invent a scheduler from scratch. I would like to bring to your attention the following work: https://github.com/taskflow/taskflow. It has a highly efficient task-stealing scheduler.
That looks very cool. Mos def we should take good ideas (and good oss code) from wherever we can get it. There's also value in having a simple, efficient-ish thread pool scheduler as an example. Maybe something based on the taskflow scheduler design could serve both needs?
Just wanna ping @tsung-wei-huang who is the developer of TaskFlow. Earlier this summer, I have talked to Tsung-Wei about exploring sender + TaskFlow. Let me know if there is any collaboration opportunity : )