Memory leak / large overhead vs scoped_threadpool
Closed this issue · 6 comments
We had a brief exchange on reddit regarding scoped_threadpool vs jobsteal. I saw your reply to my question about jobsteal vs scoped_threadpool rather late:
https://www.reddit.com/r/rust/comments/47oetc/help_with_simple_array_parallelism/
I gave jobsteal another shot and quickly converted my software rasterizer to use it (it's just three places where a few function names have to be swapped etc.):
https://github.com/blitzcode/rust-exp
https://github.com/blitzcode/rust-exp/blob/master/rs-src/rasterizer.rs
Performance goes down quite a bit, but there's something else happening that's more confusing. While the actual CPU utilization jumps up a lot on my quadcore machine (vs scoped_threadpools), the rendering time also becomes incredibly uneven. It's basically stuttering massively and the threading code just appears to stall/spin for long periods of time every once in a while. Looking at the profiler most time is spend in 'Worker::run_next()'.
Also, I can see in the profiler that memory usage is constantly increasing while the program runs, it's constant with scoped_threadpool.
Thanks for reporting this, I really appreciate you field testing jobsteal
, since it's often hard to come up with tests that accurately reflect real-world usage.
I'm not completely sure about what's causing the stuttering, but I'm almost certain the memory leak comes from the arenas used to allocate jobs. When the arena is full, it allocates a new one with double the size and copies over all the data. But it doesn't deallocate the old buffer -- instead, it holds onto it in a linked list of old buffers until the pool can prove that all jobs have been run. For now, that's only with a call to synchronize
, but I have been planning to have workers auto-detect when all work is done so that part of the API can be removed.
The benefit of such an allocation scheme is that all job data is allocated in contiguous memory, which is usually a Good Thing. Unfortunately, the space complexity of creating N jobs (ever!) converges at 2N until a call to synchronize
. For practical purposes, that's probably not acceptable.
I'm fairly certain I've fixed the memory leak on my local branch, but the performance is still worse than scoped_threadpool
-- by more than half on my machine.
Honestly, I think that work-stealing is just really not that great for distributing those kinds of loops across cores, at least when programmed in linear fashion. This is a result of a few factors:
- idle workers choose targets to steal from randomly
- all the jobs are on a single worker's queue
- steal operations are expensive (requiring a few atomic operations and a full memory barrier)
Recursive work splitting leads to any stolen jobs creating new jobs on the thief's queue, so jobs are better spread out across cores and less time is wasted on steal operations. Rayon uses recursive splitting underneath the hood in its parallel iterators, and I've been toying with the idea of a similar abstraction for jobsteal
.
The whole scope
API is vaguely misleading in the above sense, since it makes using the pool sub-optimally extremely simple, so it might be for the best if that were removed altogether and replaced with something equally ergonomic but more sensible.
Thanks for the quick response ;-) My 'repro case' is a bit unwieldy because it's a program written in two different languages. I forgot to mention it in the original bug report, but there is a branch called jobsteal which has the changes vs scoped_threadpool. I think you're probably right and my use case is too simplistic for a work stealing system to make sense. I just need some basic OpenMP type, statically scheduled data parallelism.
I was testing against your jobsteal branch when I wrote my earlier posts. Also, the memory leak fix I mentioned in the above post has landed in 0.4.2
.
Your workload is probably fine for work-stealing, although having it statically scheduled would probably beat anything else -- this is more of a jobsteal API issue than anything else, but it hasn't necessarily come to the surface until now. Doing parallel-for naively, which the jobsteal API encourages, will really hurt work-stealing's efficiency. Although you could probably rewrite all your multithreaded code to operate in a fork-join manner, it would probably be a lot higher effort, especially once you throw in iterators like zip
. Rayon has the right idea here: emulate the iterator API and handle splitting at the bottom level.
On one hand, I feel like copying parallel iterators, since they'd work great with jobsteal. On the other hand, I'd probably feel bad about copying them, so I might be interested in coming up with a similar approach.
Ok, good news then ;-)
The rasterization code itself might benefit from dynamic scheduling to some degree as the workload per-tile is hard to predict. I had experimented with making it hierarchical, which would result into fork-join style parallelism.
I'll probably give Rayon a shot as well. I'd like to dive into writing my own concurrency primitives as well, but there's still a few things I'd need to learn about Rust before opening that can of worms...
I think the primary issue here (the memory leak) has been fixed. I've opened a new issue to discuss making jobsteal's API more friendly to the backing work-stealing scheduler.