The dependencies thread-stall problem: how to fix?

Question

The dependencies thread-stall problem: how to fix?

LilithSilver opened this issue a year ago · 5 comments

Consider the following:

A scheduler has 2 threads (for simplicity), thread A and thread B
Four jobs X, Y, Z, W are scheduled and flushed:
- Job X has a runtime of 100ms
- Job Y has a dependency on Job X, and a runtime of 10ms
- Jobs Z and W have a runtime of 2ms each, and have no dependencies.

In the current system, here's what will happen:

Thread A starts working on job X
Thread B starts working on job Y
After 100ms, thread A completes job X and moves on to job Z
Job Y is signaled to continue now that its dependency is resolved
Thread A completed job Z after 2ms, picks up job W
Thread A completes job W after 2ms
After 6ms more, Thread B completes, and we're done.

Clearly, there's a problem here! Thread B was stuck on Job Y, even though it could have moved on to jobs Z and W first with no penalty.

How big of a problem is this? Do you have any ideas for fixing it?

A few of my thoughts:

I don't think a full Work Stealing algorithm is relevant here. From what I've read, the advantages of Work Stealing are mostly seen in a fork/join context, but we explicitly cannot fork from within jobs (you can only schedule on the main thread, with good reason).
We may be able to draw concepts from Work Stealing. For example, the idea of a Deque instead of a Queue, when done properly, can let heavier jobs go to the top and lighter ones go to the bottom. But I don't know at all how that would be applied.
It might help some to insertion-sort the unflushed job queue by dependency count. That would ensure all jobs in a given flush would have mostly-ideal sorting for concurrency. But if the user flushes a lot it would become moot.
A priority queue implementation probably wouldn't work, because there would be no way to update the dependency count once the job is added, not without removing and re-adding jobs. That sounds like a concurrency nightmare.
One option could be that, when a thread picks up a job, it checks if its dependencies are complete. If not, it puts it at the back of the queue and tries another. I don't know if this would be efficient, though, and at what point the thread should give up and just await a dependency handle (avoiding busy-waiting/spinning). And at that point the queue might be in an even worse state.

Any ideas, here? Nothing particularly good is coming to mind; just half-solutions.

Answer 1 · 2023-10-13T15:15:48.000Z

This is really tricky. Logically, we would have to work the "tree" from the bottom up to avoid locks. The endpoints would have to be at the top of the queue (or stack?) and be processed. Maybe it would even make sense if each thread gets its own queue and we make sure that all dependencies are processed by the same thread? But that would not be so clean either... i would be interested how unity solves this problem :o

Answer 2 · 2023-10-13T23:12:32.000Z

Here's one paper that may be of use:

https://tsung-wei-huang.github.io/papers/icpads20.pdf

I haven't dug through it in detail but the first bit sounds very promising, and similar to what we have already.

Of note:

When a worker completes a task, it automatically decrements the dependency of successive tasks and pushes new tasks to its queue whenever dependency is met.

This means that in this paper's implementation, it only adds the tasks to queues when the dependencies are surely met. For us, this might turn into an O(n) overhead where n is the max amount of concurrent jobs (so ideally not very big), when each job is completed and its successors are analyzed to see if they're ready to go. I'm not sure how bad that would be; we'd have to profile an implementation I believe. The alternative is to associate each Job in the pool with an array of dependants which is updated when more tasks are flushed. We'd be able to do a non-alloc solution by recycling them, but it would create more memory pressure.

Answer 3 · 2023-10-14T01:43:43.000Z

Thought about it for a bit, read the paper, and I believe the latter approach is correct. It will cause exactly the same memory pressure as #12, and it would fix #12 in the process. So I think the way to go is:

Each Job in the pool has:
- A list of size n of JobID (n being the max concurrent jobs). It tracks which jobs depend on it. (i.e. its children in a dependency graph). Call it Job.Dependents
- An integer counting how many jobs it is waiting on before it can begin (i.e. a count of its direct parents in a dependency graph). Call it Job.DependencyCount
On Schedule() with some List<JobHandle> dependencies:
- Lock the pool
- Add the new job to the pool as job j, reusing the old j.Dependents and j.WaitHandle
- For each dependency in dependencies...
  - Look up the dependency in the pool, d.
  - If d is already completed, discard the dependency.
  - If it is awaiting completion, increase j.DependencyCount, and add j.ID to d.Dependents.
- Unlock the pool
On Flush():
- Lock the pool
- Add any newly scheduled jobs with j.DependencyCount == 0 to ConcurrentQueue<JobMeta> Jobs
- Unlock the pool
- Notify the threads with CheckQueueEvent to look at the main queue
Threads run the jobs as normal, removing the items from Jobs until no more items remain.
- Threads Reset() the CheckQueueEvent when they retrieve the last item in the pool (? concurrency issue here maybe, be very careful! If we reset the event too quickly, it might leave only one thread active when there's still stuff in the queue!) [1]
On a thread's MarkComplete() of a job j:
- Lock the pool
- For each dependent in j.Dependents...
  - Look up the dependent in the pool, d
  - Decrement d.DependencyCount
  - If d.DependencyCount == 0...
    - Enqueue d into Jobs
- Return j to the pool, using the current algorithm to ensure any subscribers are notified before it's recycled
- Unlock the pool
- If we added anything to Jobs, notify CheckQueueEvent

I believe this would fix both #16 and #12. It avoids under-subscription as long as [1] is solved (possibly with a second lock). However, it is only the first step. The second step is to implement the work-stealing algorithm from the paper I linked, above. In my algorithm here, it creates an over-subscription problem, where every sleeping thread is pinged for every single job added to the queue. So threads are waking up even if they have nothing to do, contending with each other, and then going to sleep again, leading to increased overhead and lower power efficiency, every single time a job is completed!

So, to summarize, I think the process is...

Solve #16 and #12 with this algorithm
Once that's done, open a new issue/PR for work-stealing using the paper's algorithm, which is an extension on my main-queue-focused algorithm

Does this all make sense? Would you mind reading through that for a sanity-check?

Answer 4 · 2023-10-15T15:03:22.000Z

Sorry, the last few days was busy. Could not answer before ^^

First of all, are you also in the arch discord? If so, what is your name? Because of your many contributions I would promote you to maintainer. So you have then also here in this repo some permissions :)

I will read through the paper "briefly". Your summary sounds logical though. So the dependencies are tracked, processed from bottom to top and as soon as a "next node" has reached a dependency count of 0, it lands in the queue to be processed.

I even suspect that this adds almost no overhead. Or rather only very little. I'll read the paper! :)

Answer 5 · 2023-10-15T18:01:08.000Z

I'll join the discord, thanks for letting me know about it!