LLNL/ExaCA

Replace deep_copy in steering vector creation subroutine with parallel_reduce

Closed this issue · 5 comments

Replace deep_copy in steering vector creation subroutine with parallel_reduce

deep_copy is needed here to avoid memory access errors

If I remember correctly, the idea here was to remove the count View entirely. Instead we can return a scalar from the reduction and use it for the next loop bound

If we did this we would need to split the count kernel by itself, then fill the steering vector, then do the cell capture. So we wouldn't actually reduce kernel launch, but may still be faster without the atomic and by making each kernel simpler

The atomic operation in the parallel reduce is needed regardless, and the overhead of the thread-safe addition in the parallel reduce appears to be larger than the time required for the deep_copy at the end of the parallel_for

Closing as won't do