[Requirement]: The RISC-V RVV vector intrinsic must include support for vector groups in the __riscv_vfredosum function
Popebl opened this issue · 4 comments
Dear all:
I aim to achieve high performance on RISC-V RVV hardware by utilizing vector groups, such as setting LMUL to 4.
Unfortunately, I experienced a significant loss in performance because the intrinsic does not support vector groups for __riscv_vfredosum.
For instance, consider the following source code and its corresponding assembly code.
vfloat32m4_t x_vec;
vfloat32m4_t x_forward_vec;
vfloat32m4_t temp_vec;
/**
* I have to use m1 to complicat intrisic
*/
vfloat32m1_t dst_vec;
vfloat32m1_t src_vec;
float result = 0.0f;
float shift_prev = 0.0f;
size_t n = 64;
for(size_t vl; n>0; n -=vl){
vl = __riscv_vsetvl_e32m4(n); //LMUL=4
x_vec = __riscv_vle32_v_f32m4(&x[0], vl);
x_forward_vec = __riscv_vle32_v_f32m4(&x[0], vl);
temp_vec = __riscv_vfmul_vv_f32m4(x_vec, x_forward_vec, vl);
/**
* I have to use m1 to complicat intrisic
*/
dst_vec = __riscv_vfmv_s_f_f32m1_tu(dst_vec, 0.0f, vl); //clean for vfredosum
//vfloat32m1_t __riscv_vfredosum(vfloat32m4_t vs2, vfloat32m1_t vs1, size_t vl);
dst_vec = __riscv_vfredosum_tu(dst_vec, temp_vec, src_vec, vl);
r[0] = __riscv_vfmv_f_s_f32m1_f32(dst_vec);
}
00000000800000f0 <foo_vec.constprop.0>:
800000f0: 04000713 li a4,64
800000f4: 82018693 add a3,gp,-2016 # 8000c020 <foo_x>
800000f8: 0c0777d7 vsetvli a5,a4,e8,m1,ta,ma
800000fc: cd087057 vsetivli zero,16,e32,m1,ta,ma
80000100: 5e003157 vmv.v.i v2,0
80000104: 0907f057 vsetvli zero,a5,e32,m1,tu,ma
80000108: 420060d7 vmv.s.x v1,zero
8000010c: 0927f057 vsetvli zero,a5,e32,m4,tu,ma
80000110: 0206e207 vle32.v v4,(a3)
80000114: 92421257 vfmul.vv v4,v4,v4
80000118: 0e4110d7 vfredosum.vs v1,v4,v2
8000011c: 421017d7 vfmv.f.s fa5,v1
80000120: 40f6a027 fsw fa5,1024(a3)
80000124: 8f1d sub a4,a4,a5
80000126: fb69 bnez a4,800000f8 <foo_vec.constprop.0+0x8>
80000128: 8082 ret
The compiler adds two additional vsetivli instructions to support the __riscv_vfredosum_tu function. Based on our research, the vsetvli instruction significantly reduces performance.
[Suggestion]: If the intrinsic supports 'vfloat32m4_t __riscv_vfredosum(vfloat32m4_t vs2, vfloat32m4_t vs1, size_t vl)', there would be no need to insert a vsetli instruction in the final code generated by the compiler. This change could lead to higher performance in RISC-V RVV implementations.
Thanks very much!
Is the initialization of src_vec
missing from your C code. It seems to be initialized with a vmv.v.i in the assembly. It could be initialized with vmv.s.x since only element 0 is used.
The compiler's vsetvli insertion pass could be taught that vmv.s.x doesn't care about lmul and can share the same vsetvli as the vredosum. I believe this is being worked on in the LLVM compiler.
After adding the initial src_vec using __riscv_vfmv_s_tu, there are still three vsetvli instructions remaining.
__attribute__((noinline)) static void foo_vec(float *r, const float *x)
{
int i, k;
vfloat32m4_t x_vec;
vfloat32m4_t x_forward_vec;
vfloat32m4_t temp_vec;
/**
* I have to use m1 to complicat intrisic
*/
vfloat32m1_t dst_vec;
vfloat32m1_t src_vec;
float result = 0.0f;
float shift_prev = 0.0f;
size_t n = 64;
for(size_t vl; n>0; n -=vl){
vl = __riscv_vsetvl_e32m4(n); //LMUL=4
x_vec = __riscv_vle32_v_f32m4(&x[0], vl);
x_forward_vec = __riscv_vle32_v_f32m4(&x[0], vl);
temp_vec = __riscv_vfmul_vv_f32m4(x_vec, x_forward_vec, vl);
/**
* I have to use m1 to complicat intrisic
*/
//vfloat32m1_t __riscv_vfmv_s_tu(vfloat32m1_t vd, float rs1, size_t vl);
src_vec = __riscv_vfmv_s_tu(src_vec, 0.0f, vl); //initial src_vec
//dst_vec = __riscv_vfmv_s_f_f32m1_tu(dst_vec, 0.0f, vl); //clean for vfredosum
dst_vec = __riscv_vfmv_s_tu(dst_vec, 0.0f, vl); //clean for vfredosum
dst_vec = __riscv_vfredosum_tu(dst_vec, temp_vec, src_vec, vl);
r[0] = __riscv_vfmv_f_s_f32m1_f32(dst_vec);
}
}
00000000800000f0 <foo_vec.constprop.0>:
800000f0: 04000713 li a4,64
800000f4: 82018693 add a3,gp,-2016 # 8000c020 <foo_x>
800000f8: 0c0777d7 vsetvli a5,a4,e8,m1,ta,ma
800000fc: 0907f057 vsetvli zero,a5,e32,m1,tu,ma
80000100: 42006157 vmv.s.x v2,zero
80000104: 420060d7 vmv.s.x v1,zero
80000108: 0927f057 vsetvli zero,a5,e32,m4,tu,ma
8000010c: 0206e207 vle32.v v4,(a3)
80000110: 92421257 vfmul.vv v4,v4,v4
80000114: 0e4110d7 vfredosum.vs v1,v4,v2
80000118: 421017d7 vfmv.f.s fa5,v1
8000011c: 40f6a027 fsw fa5,1024(a3)
80000120: 8f1d sub a4,a4,a5
80000122: fb79 bnez a4,800000f8 <foo_vec.constprop.0+0x8>
80000124: 8082 ret
No, we don't need such intrinsic.
This is compiler issue.
I have confirm there is a regression in GCC:
https://godbolt.org/z/vocK8cee4
GCC-13 is able to generate optimal vsetvls, wheras GCC trunk doesn't.
I have file a PR for it to recover back the performance:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112776
Bug fixed by https://gcc.gnu.org/pipermail/gcc-patches/2023-December/638850.html
Thanks very much!