Standardised API for sharing thread pools
Wodann opened this issue · 15 comments
In the working group meeting #67, @kabergstrom mentioned that several crates that use threads pools, use the OS to handle time slicing (e.g. Rayon, Tokio) and as such are at risk of falling outside of the Rust game ecosystem. More concretely, a solution would: let the user have control over multiplexing executor work onto OS threads.
To resolve this issue, they proposed designing a standardised API for sharing thread pools in the spirit of raw-window-handle.
There is a Reddit discussion in which we are gauging interest.
@repi If the information was relayed correctly, having crates implement a trait like this would solve your issues with crates that implement their own executor. Do you have any concerns that are not covered by the proposed trait?
As gamedev wg it would be great to also discuss the needs specially for engines and games. In particular, how to handle priorities of certain tasks and potential pinning to specific threads, which would not be covered by the current SpawnExt
trait imo.
I have a rough proposal for an API. The idea is to provide an API that lets the user be in control of when each executor crate runs work, to provide some control around time budgets and to let the user be in control of how executors are multiplexed on OS threads.
/// This is implemented by tokio/rayon/async-std for their executors.
/// The user builds as many workers as is desired, and places them onto threads as desired.
trait WorkerBuilder {
/// Builds a worker that can notify the user of work available by using the provided Waker
fn build_worker(waker: Waker) -> Box<dyn Worker>;
}
/// Implemented by tokio/rayon/async-std. This is the executor itself, which polls futures or runs queued tasks.
trait Worker {
/// Polls the worker, doing work if available.
/// The time_budget argument indicates the caller's desire for the executor to finish within the duration
/// May return a Duration that indicates the worker's desire to be polled again at the expiration of the duration
fn poll(&mut self, time_budget: Option<Duration>) -> Option<Duration>;
}
... create Workers and spawn worker_threads...
usage:
fn worker_thread(parker: Parker, workers: Vec<Worker>) {
loop {
// in more complex cases, you may want to prioritize work as Wodann said,
// and only poll the Workers that are most important, for example frame job workers.
for worker in workers {
// should keep track of each worker's wakeup timeout desire and wake as appropriate
worker.poll(Some(worker_time_budget));
}
// The Wakers provided to Workers would unpark this Parker
parker.park();
}
}
After lots of good discussion in Discord and thinking about it a bit, I think this is a much more difficult problem than what raw-window-handle addresses. I do also perceive that there is some risk of ecosystem split, so I'd like to see something done, but I don't think a solution will come easily. It seems like even defining the problem in a way that everyone completely agrees with is difficult.
Case in point, I was thinking of the problem differently than @kabergstrom. As I understand it, his proposal inserts extensibility at a different layer of the stack than what I had in mind. I think both approaches could be useful and are fairly orthogonal.
The problem as I had it in mind was that many crates send their work directly to a thread pool implementation. So for example:
[Specs/Shred] -> [Rayon] -> [Hardware Threads]
In this example, specs/shred is strongly coupled to rayon. AFAIK there isn't a way to have the work sent to tokio or some other executor.
@bitshifter mentioned PhysX has a solution for this:
https://gameworksdocs.nvidia.com/PhysX/4.1/documentation/physxguide/Manual/Threading.html#cpudispatcher
At first I was thinking we could recommend crates offer an API like this, but this could end up being quite a lot of work for people maintaining them. Crates like rayon are really pleasant and easy to use, allowing code like this: (0..100).into_par_iter().for_each(|x| println!("{:?}", x))
I don't think we would be successful asking people to change from that to rolling their own task delegation layer.
I also think there is potentially a lot of diversity in what kinds of tasks a crate can produce. Tasks could be long/short-running, low/high priority, IO/CPU bound. Sometimes an end-user will want the work generated by an upstream crate to be pinned to a particular thread. Sometimes it's important to allow tasks to stack up to create back pressure and slow down the amount of work an upstream crate is producing. Some tasks are fire-and-forget, and other block code that needs to run immediately after the work is done, possibly using a result from the tasks. Different games might even need to handle work coming from the same upstream crates differently.
So even if upstream crates had a task delegation layer like PhysX, they'd probably have their own small differences, for good reason.
While a utility crate could probably be created to help upstream crates add a task delegation layer, I think it would be difficult to come up with a single interface that expresses every possible usage an upstream crate might need. The communication is actually bidirectional - the crate generating the work has to express what to do, and also be able to listen for a result.
As I mentioned before, this is different from @kabergstrom's approach. I don't think one is better than the other, and I could see both approaches being used at the same time.
Whatever we do, I think it will need to be prototyped and experimented with, and the process won't be as quick and easy as it was for raw-window-handle.
@aclysma I would see this as an internal detail that would not change the user level API of any crate. For example, PhysX doesn't require you to implement their CPU dispatcher API, they provide a default implementation and it doesn't change the high level use of the library. I wouldn't expect this kind of interface to change rayon any more than their current ThreadPool
interface which is used behind the scenes. It is possible each library would have differing requirements making supporting a common interface difficult however. To determine that though someone would need to audit crates that have their own thread pools and what kind of features those thread pools use.
Proposal
This proposes a first approach regarding pushing context information from the call site over to libraries. Therefore, this proposal focuses on the library interface only based on the following assumption:
For the caller of a lib function it is sufficient to provide task relevant data at this level of abstraction (e.g high level library task won't spawn low level library tasks).
This allows to split the issue of providing an API into two parts:
- Define a guideline on how to define library APIs to pass context specific data
- Provide a common trait for specific tasks
Practical Part
IMO the issue of defining a task API is similar to passing custom allocators down to libraries. Which leads to point 1 being the same for both issues (task & allocator), while the 2nd is specific to the problem.
To tackle the 1st point, the proposal would be to create suballocators and subexecutors (let's call them context) by the caller and pass these to the library
Example
let low_task_executor = main_executor.low_priority();
entities.par_iter(&low_task_executor).for_each(|x| { .. });
let linear_allocator = main_allocator.get_linear_allocator(..);
renderer.set_allocator(linear_allocator);
The low_task_executor
would implement a common Executor
trait and linear_allocator
a common Alloc
trait.
Pros/Cons
Pros:
- Doesn't require a
#[global_executor]
or further language support - Quite flexible, but also allows 'simple' interfaces for libraries
Cons:
- Possible limitations for library creators
- Can be quite verbose to pass these around and complicates the API (there are similar issues when designing UI APIs..)
Hi! I was pointed to this discussion and was wondering how the async-std
team could help there, potentially working against a ecosystem split. We have also been in touch with other groups around special execution needs, e.g. media streaming.
A little known fact about async-std
is that it comes in 3 pieces:
async-task
, a general purpose task allocator, shipped as a library.- The main API for IO handling
- The runtime
If you compile async-std
without the "runtime" flag, you basically get a hollow interface. That allows you to ship your own variant of it, better tuned to your use-cases. This runtime could have specialised spawning interfaces, fulfilling your needs better.
async-std
is built with the idea that you may need to choose your own execution model and also gives you ready-made tools do build your own executor. It's default implementation hides all that and gives you no access to the internal runtime, but that also gives you the ability to move to something more special and better geared towards your environment, while not breaking depending libraries.
We'd be very interested in talking about the problem of libraries not abstracting over executors and not being prepared for the presence of multiple executors and want to spend time designing there.
We already have several proposals that we want to prototype with, but as discussed in the wg meeting it'd be good to know the use cases that the prototype API should test:
- Don't leave time slicing to the OS (@kabergstrom)
If any use cases are missing, please list them.
There's a new Repo for the prototypes to be collected into:
https://github.com/rust-gamedev/thread-pool-api-prototypes
Job systems in the wild with focus on the executor part (excludes data dependencies, high level scheduling over multiple frames etc) with a short description:
Name | Reference | Description |
---|---|---|
Parallelizing the Naughty Dog engine using fibers | Slides | - Using fiber based system (rough scale: OS threads (1-10) -> pool of fibers (10-100) -> jobs (100-1000)). - Requires knowledge of the executor or rather execution context by the jobs due to sync primitives (also an issue with futures in general!). - I/O handled in OS threads. - Should allow to spawn jobs inside of jobs and yield to it. Jobs separated into 3 queues based on priorities |
Multithreading the Entire Destiny Engine | Video | System specific thread pool layout (PS3 <-> PS4 <-> XBOX1 <-> ..) |
Marvel's Spider-Man': A Technical Postmortem | Video at ~2min | 2 locked/pinned (?) threads (main and rendering), 4 workers each 3 threads each with different priority (pinned to one core), I/O thread and further ones for audio, physics etc. |
(Ideally, the API should not hinder intergration of profiling/debugging middleware like RAD Telemetry)
I found another API example of what a thread pool API might look like in C++ land. Another piece of physics middleware, this time the FEMFX library from AMD - https://gpuopen.com/gaming-product/femfx/
The interface appears to be a bunch of function pointers - https://github.com/GPUOpen-Effects/FEMFX/blob/master/amd_femfx/inc/FEMFXTaskSystemInterface.h
You can see an implementation that has compile time support for UE4's task scheduler, Intel TBB and TLTaskSystem which appears to be FEMFX's own implementation of a task system (see https://github.com/GPUOpen-Effects/FEMFX/blob/master/samples/sample_task_system/TLTaskSystem.cpp)
I thought this was another good example demonstrating usage in an AAA major game engine in addition to the PhysX interface I mentioned earlier.
https://async.rs/blog/stop-worrying-about-blocking-the-new-async-std-runtime/
This might do too much stuff automatically for it to be considered acceptable by everyone, but it's interesting as a point of reference at least.
The following blogpost highlights a crate that might cover most of this use case:
Executor trait interface: https://github.com/bastion-rs/agnostik