The-Powder-Toy/The-Powder-Toy

Possible performance optimization by precomputing + caching the effects of neighboring particle types or sorting the particles by their positions/types/neighbors.

tugrul512bit opened this issue · 2 comments

In the function Simulation::UpdateParticles(int start, int end) it scans through particles and for each particle it scan through its neighbor cells with a lot of if/else running in there.

Since the number of neighbors (like those who can affect the current particle) are only few like 8, there are only so many as 8 x number_of_particle_types of different cases. Also around interior of chunks on screen, all neighbors are same type. Around the edges where different elements meet, there are only 2 types of elements. The 8 different types per particle would be so scarce, they could easily be cached.

With caching for the extreme cases and simplifying for the 1-2 type cases, many of those if-else blocks could be moved out of loop (if not eliminated).

For example,

    // if RAM can hold all possible behaviors
    behavior_cache[generate_behavior_index_from_neighbor_types[center_particle_x_y]].execute(some_parameters);

   // if RAM is not enough for all but enough for the actual scene
   behavior_cache.get(generate_behavior_index(...)).execute(prm); // should be still faster with limited number of scenarios in scene

this should be able to get rid of many of the particle-type-related in-loop if-else conditions and directly execute the code-body of them. Behavior_cache could be something like an array of function pointers or even dynamically generated dll files per behavior & loaded later (may not be good if too many behaviors?).

But, I'm not sure if this gives enough performance to give people enough "inspiration" to start re-writing a lot of things.

(a bigger) But, maybe someone can isolate just a demo case like "sand vs 8 waters" with a dense demo scene with a lot of sand and water and put just an if/else for it and benchmark it and foresee total gain from caching of all element types instead.

Another example,

     sort(particle_indices_on(their_x_y_positions))
     process(particles_in_that_order);

if particles' process-order are sorted, the system should automatically benefit from CPU's L1 cache because they tend to get the data from already-computed-particles instead of a randomly placed address in RAM.

So, if I understand the simulation correctly, the serial processing order is "initialization order" of particles and if particles go randomly anywhere because of chemical reactions or explosions, the actual "positional" processing order of pixels are completely random per frame. Why not make this per-position instead?

Applying a grid of cells and putting particle id values into their own grid cell (O(N) complexity) should make it easy to scan whole screen cache-friendly such that even without any behavior caching the CPU's branch-predictors would find some similarity between neighboring processing iterations.

Yet another example,

    binning(particle_indices_on(their_types))
    process_sand(particle_indices_of_sand)
    process_water(particle_indices_of_water)
    ...

binning particles on type buckets would get rid of a lot of branching like if ((elements[t].Properties found in the nested loop?

Extreme example,

     sort(particle_indices_on(their_neighbor_combination))
     iterate over all cases found in scene
         compute_this_case(...)
         compute_that_case(...)

imo, this should get rid of all if-else altogether but the code would be a lot unreadable.

Since the project is made of a lot of ifs,elses, the SIMD optimization is already too hard to benefit from and number of cores also doesn't matter much maybe except heat&pressure simulations (that are not that much bigger latency than particle update). So, helping CPU's branch predictors by caching should give it a boost, at the cost of readability & coding-time.

What do you think?

Your first suggestion is something I tend to call "reaction kernels", and yes,

[...] I'm not sure if this gives enough performance to give people enough "inspiration" to start re-writing a lot of things.

this is the main reason why nobody has tried to apply the idea. The current codebase is too much of an evolutionary cul-de-sac for something like this to be easily evaluated.


Your second suggestion is to exploit spatial locality better by

Applying a grid of cells and putting particle id values into their own grid cell [...]

The most prominent reason as to why this is not what the current implementation does is that multiple particles can in fact occupy the same grid cell. This is what we call "stacking", which we tend to say is "not supported", but really what we mean is weirdness caused by it is not considered buggy behaviour. Energy particles, such as PHOT, stack naturally. Years ago, this tipped the scale in the current simpler solution's favour.

If we wanted to evaluate the performance benefit of such a solution, we could keep the current by-id-indexed array, use it as overflow space for particles stacked under other particles, and keep the topmost layer in a by-position-indexed array. I've tried doing this once but gave up due to time constraints and, again, the general state of the codebase.

I also vaguely remember there being some design issue that complicated matters further than I'd expected. I think it had to do with particles being updated multiple times in a single frame if they were moved from a lower index to a higher index by a "middle index" particle. This doesn't happen if traversal is by-id because then update order is not dependent on position.


Your third suggestion is binning, but seeing this:

binning particles on type buckets would get rid of a lot of branching like if ((elements[t].Properties found in the nested loop?

it's still unclear to me how exactly you'd derive the bucket indices, but I realize that your comment was not meant to provide a concrete solution. In any case, I'd expect data cache friendliness to have a greater effect than code cache friendless (which is what binning of this sort is supposed to improve, as far as I understand), among other things because update code sprawls, so it's a sort of lost cause. UpdateParticles is not at all the entire story, individual elements also do a lot of these types of checks. I'm not sure how one would bin those.

By binning example, I meant grouping id values together in their own arrays. Such as getting all water atoms together and computing just the water related logic for themselves (not for their neighbors). It's more like a "gather" type of process. Maybe a "scatter" type would work too, like get all waters but compute the neighbors instead. Not sure if this makes the "bias" from serial processing any better though.

For individual elements' updates in the update, could a queue system help? Maybe like a queue of effects per particle, added by each neighbor. Then processed separately.