Serial computation bias vs scan-line parallelized approach just with a different kind of bias.

Question

Serial computation bias vs scan-line parallelized approach just with a different kind of bias.

tugrul512bit opened this issue a year ago · 2 comments

Since particles can be anywhere on screen and the particle updates go through the particle array the same every frame, the serial-update has a bias that makes some skewed results that are not easily visible. So, would it matter if we had multiple threads with their own biases not touching each others' borders?

Is it possible to divide the screen into scan-lines and give each thread their own 2-3 lines of pixels with 2-3 pixels of gap from other threads? Would this stop each thread writing onto other threads private pixels? I mean, compute all scan-lines then compute all gaps, there are bands of strides between all threads. Is there any side-effect that may cause a visible simulation bias compared to whole-screen bias?

In any part of simulation step, do far-away particles ever require each others' data in same update logic? What about pressure & heat simulations? Would they require extra care when multiple particles are computed at once?

If 2-3 lines of gap is not enough to separate thread-private pixels (that contain particles), how high would the gap be to totally separate dependencies of multiple-threads?

If scan-line works, then what about checker-board approach that computes black squares first then white squares (or maybe interior of them if touching corners are an issue)?

Running 4-16 instances of TPT is ok for many computers. Even if just L2 caches of other cores can store some temporary variables, it would be faster than just relying on big L3 cache right? Is there a way to harness this extra cache space without too much code re-writing? Maybe like changing current core on-the-fly (affinity) depending on id of particle or line number of pixels or something else? Maybe even separating pressure & heat data to different caches would give some speedup (by reducing cache-thrashing)?

Answer 1 · 2022-12-19T01:33:35.000Z

Far away particles can often require/change the data of other particles a ways away, even to across the entire screen. I'm not entirely sure what this is about though, so I'll just say that much and leave the rest to others.

Answer 2 · 2022-12-19T04:31:30.000Z

Noita does something like this. There is a developer's conference talk they gave once about it. Essentially, your checkerboard idea, but you have to split it into 4 parts instead of 2 because the corners touch. The size of each checkerboard tile would need some thought. Update functions for most elements scan up to two locations away. But the exceptions to this are extreme (as mentioned above). Some can literally write across the entire screen, so in order to accomodate that there needs to be a separation of "close" update functions and probably a single threaded loop for any exceptions at the end of the simulation update.
Then you get into movement, which is just as crazy. In the normal case, powders can move just one pixel per frame, but when you consider air simulation and velocity, the velocity cap is 1000 right now. This would need to be decreased to half the size of each checkerboard tile, either that or high velocity particles have to be simulated separately. Then you have liquids ... which can move 30 pixels per frame even with no velocity. Might be a good idea to revisit that.

So ultimately, this idea can't be implemented without compromises. I prefer to compromise by preserving existing behavior in separate loops after the simulation has run. I think this would still give huge performance benefits as most particles have standard behavior and movement most of the time. But then you can also compromise by reducing some of these extreme movement and update options, splitting them into separate loops only where absolutely necessary (just a few particles like *RAY and only if they shoot too far). The impacts to existing simulations might be small, and it would ensure the best performance happens in all situations.

Can't comment much on the cache stuff, as that's not my area of expertise. Recently we got a PR with an 8% performance boost by just barely changing any code. So yes, you can definitely improve performance with small changes, I just couldn't say what those are.

Just some of my thoughts ... I'd love if someone tried to implement some of these ideas, as they are not new. I've been thinking about it for years. I just don't have the time to look into the restructuring necessary. But if there's anything I've learned these past few years, it's that with just a bit of effort you can accomplish even things thought impossible. Given enough time I will eventually try to multithread tpt.