microsoft/nnfusion

[Question]How to implement the barrier-rTask in generated code

zqj2333 opened this issue · 2 comments

I have generated code of six models mentioned in paper(RAMMER,Figure 11) with nnfusion in branch "osdi20_artifact",I toke a look at these generated code and found that it seems that there is no code about how to implement the barrier-rTask mentioned in paper,
such as:
"step array",
"each rTask use its first thread to increase step array",
"barrier-rTask use its first N thread to poll on the corresponding elements in the step array".

So I want to know how the barrier-rTask reflected in the generated code.

Thanks for your response!

Hi, the block-level barrier-rTask can be enabled by set -fblockfusion_level=2. It is implemented for CUDA and ROCm in here and here. Because there is still a TODO task that automatically detect active thread blocks to avoid deadlock, the block-level barrier-rTask is not enabled by default. Therefore, you may need to pay attention to the active thread blocks manually to avoid deadlock. Current -fblockfusion_level=1 implementation leverages global kernel launch for barriers.

Hi,
Thank you for your response! I have understood it!