ROCm/hcc

Unnecessary NOPs between DPP instructions [HCC, HIP, LLVM]

misos1 opened this issue · 1 comments

According to https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ and "Vega" Instruction Set Architecture Reference Guide:

v_add_f32 v1, v0, v0 row_shr:1 bound_ctrl:0 // Instruction 1
v_add_f32 v1, v0, v1 row_shr:2 bound_ctrl:0 // Instruction 2
v_add_f32 v1, v0, v1 row_shr:3 bound_ctrl:0 // Instruction 3

If a previous VALU instruction modifies a VGPR read by DPP, two wait states are required. Note that this hazard affects only the operand that DPP reads. Consider instructions 2 and 3 in the example above; they consume the output from the previous VALU instruction by reading v1 . But DPP applies to v0 , and because v0 is unmodified, wait states are unnecessary.

But when I tried to implement this using hc::__amdgcn_move_dpp I got two NOPs between each two of these instructions:

#include <hc.hpp>
int main()
{
	hc::array_view<int> data(1);
	parallel_for_each(hc::extent<1>(1), [=](hc::index<1> i) [[hc]]
	{
		asm("s_nop 0");
		int v0 = data[0], v1;
		v1 = hc::__amdgcn_move_dpp(v0, 0x111, 0xF, 0xF, true) + v0;
		v1 = hc::__amdgcn_move_dpp(v0, 0x112, 0xF, 0xF, true) + v1;
		v1 = hc::__amdgcn_move_dpp(v0, 0x113, 0xF, 0xF, true) + v1;
		data[i[0]] = v1;
	});
	return 0;
}
KMDUMPISA=1 hcc -hc main.cpp

dump-gfx900.isa:

	v_add_u32_dpp v3, v2, v2  row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0
	s_nop 1
	v_add_u32_dpp v3, v2, v3  row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:0
	s_nop 1
	v_add_u32_dpp v2, v2, v3  row_shr:3 row_mask:0xf bank_mask:0xf bound_ctrl:0

These NOPs does not must be here because next instruction does not have "DPP dependency" from previous. DPP modifier here affects first source operand.

Ideally this should generate 3 instructions as shown above without NOPs.

Also it is interesting how generated code changes with small modifications to source:

Removing asm("s_nop 0");:

	parallel_for_each(hc::extent<1>(1), [=](hc::index<1> i) [[hc]]
	{
		int v0 = data[0], v1;
		v1 = hc::__amdgcn_move_dpp(v0, 0x111, 0xF, 0xF, true) + v0;
		v1 = hc::__amdgcn_move_dpp(v0, 0x112, 0xF, 0xF, true) + v1;
		v1 = hc::__amdgcn_move_dpp(v0, 0x113, 0xF, 0xF, true) + v1;
		data[i[0]] = v1;
	});
	v_mov_b32_dpp v2, v2  row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0
	v_add_u32_e32 v2, s6, v2
	v_mov_b32_e32 v3, s1
	s_nop 0
	v_add_u32_dpp v2, v1, v2  row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:0
	s_nop 1
	v_add_u32_dpp v2, v1, v2  row_shr:3 row_mask:0xf bank_mask:0xf bound_ctrl:0

Changing 0 to i[0]:

	parallel_for_each(hc::extent<1>(1), [=](hc::index<1> i) [[hc]]
	{
		int v0 = data[i[0]], v1;
		v1 = hc::__amdgcn_move_dpp(v0, 0x111, 0xF, 0xF, true) + v0;
		v1 = hc::__amdgcn_move_dpp(v0, 0x112, 0xF, 0xF, true) + v1;
		v1 = hc::__amdgcn_move_dpp(v0, 0x113, 0xF, 0xF, true) + v1;
		data[i[0]] = v1;
	});
	v_mov_b32_e32 v4, v2
	s_nop 0
	v_add_u32_dpp v3, v2, v2  row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0
	v_mov_b32_dpp v4, v4  row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:0
	v_mov_b32_dpp v2, v2  row_shr:3 row_mask:0xf bank_mask:0xf bound_ctrl:0
	v_add3_u32 v2, v3, v4, v2

This last has correctly only one NOP for v4 dependency so together with v_add_u32_dpp v3, v2, v2... there are 2 independent instructions between modification to v4 v_mov_b32_e32 v4, v2 and DPP read v_mov_b32_dpp v4, v4.... And uses add3 but this still looks worse than 3 v_add_dpp.

Thank you for reporting this problem! We have opened an internal ticket to get it fixed.