Unnecessary NOPs between DPP instructions [HCC, HIP, LLVM]
misos1 opened this issue · 1 comments
According to https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ and "Vega" Instruction Set Architecture Reference Guide:
v_add_f32 v1, v0, v0 row_shr:1 bound_ctrl:0 // Instruction 1
v_add_f32 v1, v0, v1 row_shr:2 bound_ctrl:0 // Instruction 2
v_add_f32 v1, v0, v1 row_shr:3 bound_ctrl:0 // Instruction 3
If a previous VALU instruction modifies a VGPR read by DPP, two wait states are required. Note that this hazard affects only the operand that DPP reads. Consider instructions 2 and 3 in the example above; they consume the output from the previous VALU instruction by reading v1 . But DPP applies to v0 , and because v0 is unmodified, wait states are unnecessary.
But when I tried to implement this using hc::__amdgcn_move_dpp
I got two NOPs between each two of these instructions:
#include <hc.hpp>
int main()
{
hc::array_view<int> data(1);
parallel_for_each(hc::extent<1>(1), [=](hc::index<1> i) [[hc]]
{
asm("s_nop 0");
int v0 = data[0], v1;
v1 = hc::__amdgcn_move_dpp(v0, 0x111, 0xF, 0xF, true) + v0;
v1 = hc::__amdgcn_move_dpp(v0, 0x112, 0xF, 0xF, true) + v1;
v1 = hc::__amdgcn_move_dpp(v0, 0x113, 0xF, 0xF, true) + v1;
data[i[0]] = v1;
});
return 0;
}
KMDUMPISA=1 hcc -hc main.cpp
dump-gfx900.isa:
v_add_u32_dpp v3, v2, v2 row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0
s_nop 1
v_add_u32_dpp v3, v2, v3 row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:0
s_nop 1
v_add_u32_dpp v2, v2, v3 row_shr:3 row_mask:0xf bank_mask:0xf bound_ctrl:0
These NOPs does not must be here because next instruction does not have "DPP dependency" from previous. DPP modifier here affects first source operand.
Ideally this should generate 3 instructions as shown above without NOPs.
Also it is interesting how generated code changes with small modifications to source:
Removing asm("s_nop 0");
:
parallel_for_each(hc::extent<1>(1), [=](hc::index<1> i) [[hc]]
{
int v0 = data[0], v1;
v1 = hc::__amdgcn_move_dpp(v0, 0x111, 0xF, 0xF, true) + v0;
v1 = hc::__amdgcn_move_dpp(v0, 0x112, 0xF, 0xF, true) + v1;
v1 = hc::__amdgcn_move_dpp(v0, 0x113, 0xF, 0xF, true) + v1;
data[i[0]] = v1;
});
v_mov_b32_dpp v2, v2 row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0
v_add_u32_e32 v2, s6, v2
v_mov_b32_e32 v3, s1
s_nop 0
v_add_u32_dpp v2, v1, v2 row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:0
s_nop 1
v_add_u32_dpp v2, v1, v2 row_shr:3 row_mask:0xf bank_mask:0xf bound_ctrl:0
Changing 0 to i[0]:
parallel_for_each(hc::extent<1>(1), [=](hc::index<1> i) [[hc]]
{
int v0 = data[i[0]], v1;
v1 = hc::__amdgcn_move_dpp(v0, 0x111, 0xF, 0xF, true) + v0;
v1 = hc::__amdgcn_move_dpp(v0, 0x112, 0xF, 0xF, true) + v1;
v1 = hc::__amdgcn_move_dpp(v0, 0x113, 0xF, 0xF, true) + v1;
data[i[0]] = v1;
});
v_mov_b32_e32 v4, v2
s_nop 0
v_add_u32_dpp v3, v2, v2 row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0
v_mov_b32_dpp v4, v4 row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:0
v_mov_b32_dpp v2, v2 row_shr:3 row_mask:0xf bank_mask:0xf bound_ctrl:0
v_add3_u32 v2, v3, v4, v2
This last has correctly only one NOP for v4 dependency so together with v_add_u32_dpp v3, v2, v2...
there are 2 independent instructions between modification to v4 v_mov_b32_e32 v4, v2
and DPP read v_mov_b32_dpp v4, v4...
. And uses add3
but this still looks worse than 3 v_add_dpp
.
Thank you for reporting this problem! We have opened an internal ticket to get it fixed.