parallel-runtimes/lomp

Support tree reductions

Opened this issue · 0 comments

The runtime currently supports atomic reductions and those which use a critical section, but each of those is linear since every thread is contending for access to the same reduction target buffer. The compiler already generates code which should allow the implementation of reduction up a tree at a tree barrier. That allows reduction operations to be happening concurrently in separate sub-trees, and should, therefore, have better performance for large reductions.

We should add code to support this.

The main complexity here is likely understanding the compiler interface!
This will probably also need some small changes in the barrier code implementation(s), since the reduction needs to happen in non-leaf threads at the point where they see that a child thread has checked in, but before they pass the "we're all here" message up the tree. (Ideally, as each thread arrives its contribution can be accumulated).