riscv-non-isa/rvv-intrinsic-doc

[Requirement]: The RISC-V RVV vector intrinsic must include support for vector groups in the __riscv_vfredosum function

Popebl opened this issue · 4 comments

Popebl commented

Dear all:
I aim to achieve high performance on RISC-V RVV hardware by utilizing vector groups, such as setting LMUL to 4.
Unfortunately, I experienced a significant loss in performance because the intrinsic does not support vector groups for __riscv_vfredosum.
For instance, consider the following source code and its corresponding assembly code.

vfloat32m4_t x_vec;
vfloat32m4_t x_forward_vec;
vfloat32m4_t temp_vec;

/**
 * I have to use m1 to complicat intrisic
 */
vfloat32m1_t dst_vec;
vfloat32m1_t src_vec;

float result = 0.0f;
float shift_prev = 0.0f;

size_t n = 64;
for(size_t vl; n>0; n -=vl){
    vl = __riscv_vsetvl_e32m4(n); //LMUL=4
    x_vec = __riscv_vle32_v_f32m4(&x[0], vl);
    x_forward_vec = __riscv_vle32_v_f32m4(&x[0], vl);
    temp_vec = __riscv_vfmul_vv_f32m4(x_vec, x_forward_vec, vl);

    /**
     * I have to use m1 to complicat intrisic
     */
    dst_vec = __riscv_vfmv_s_f_f32m1_tu(dst_vec, 0.0f, vl); //clean for vfredosum
    //vfloat32m1_t __riscv_vfredosum(vfloat32m4_t vs2, vfloat32m1_t vs1, size_t vl);
    dst_vec = __riscv_vfredosum_tu(dst_vec, temp_vec, src_vec, vl);

    r[0] = __riscv_vfmv_f_s_f32m1_f32(dst_vec);
}
00000000800000f0 <foo_vec.constprop.0>:
    800000f0:   04000713                li      a4,64
    800000f4:   82018693                add     a3,gp,-2016 # 8000c020 <foo_x>
    800000f8:   0c0777d7                vsetvli a5,a4,e8,m1,ta,ma
    800000fc:   cd087057                vsetivli        zero,16,e32,m1,ta,ma
    80000100:   5e003157                vmv.v.i v2,0
    80000104:   0907f057                vsetvli zero,a5,e32,m1,tu,ma
    80000108:   420060d7                vmv.s.x v1,zero
    8000010c:   0927f057                vsetvli zero,a5,e32,m4,tu,ma
    80000110:   0206e207                vle32.v v4,(a3)
    80000114:   92421257                vfmul.vv        v4,v4,v4
    80000118:   0e4110d7                vfredosum.vs    v1,v4,v2
    8000011c:   421017d7                vfmv.f.s        fa5,v1
    80000120:   40f6a027                fsw     fa5,1024(a3)
    80000124:   8f1d                    sub     a4,a4,a5
    80000126:   fb69                    bnez    a4,800000f8 <foo_vec.constprop.0+0x8>
    80000128:   8082                    ret

The compiler adds two additional vsetivli instructions to support the __riscv_vfredosum_tu function. Based on our research, the vsetvli instruction significantly reduces performance.

[Suggestion]: If the intrinsic supports 'vfloat32m4_t __riscv_vfredosum(vfloat32m4_t vs2, vfloat32m4_t vs1, size_t vl)', there would be no need to insert a vsetli instruction in the final code generated by the compiler. This change could lead to higher performance in RISC-V RVV implementations.

Thanks very much!

Is the initialization of src_vec missing from your C code. It seems to be initialized with a vmv.v.i in the assembly. It could be initialized with vmv.s.x since only element 0 is used.

The compiler's vsetvli insertion pass could be taught that vmv.s.x doesn't care about lmul and can share the same vsetvli as the vredosum. I believe this is being worked on in the LLVM compiler.

Popebl commented

After adding the initial src_vec using __riscv_vfmv_s_tu, there are still three vsetvli instructions remaining.

__attribute__((noinline)) static void foo_vec(float *r, const float *x)
{
int i, k;

vfloat32m4_t x_vec;
vfloat32m4_t x_forward_vec;
vfloat32m4_t temp_vec;

/**
 * I have to use m1 to complicat intrisic
 */
vfloat32m1_t dst_vec;
vfloat32m1_t src_vec;

float result = 0.0f;
float shift_prev = 0.0f;

size_t n = 64;
for(size_t vl; n>0; n -=vl){
    vl = __riscv_vsetvl_e32m4(n); //LMUL=4
    x_vec = __riscv_vle32_v_f32m4(&x[0], vl);
    x_forward_vec = __riscv_vle32_v_f32m4(&x[0], vl);
    temp_vec = __riscv_vfmul_vv_f32m4(x_vec, x_forward_vec, vl);

    /**
     * I have to use m1 to complicat intrisic
     */
    //vfloat32m1_t __riscv_vfmv_s_tu(vfloat32m1_t vd, float rs1, size_t vl);
    src_vec = __riscv_vfmv_s_tu(src_vec, 0.0f, vl);                 //initial src_vec
    //dst_vec = __riscv_vfmv_s_f_f32m1_tu(dst_vec, 0.0f, vl);         //clean for vfredosum
    dst_vec = __riscv_vfmv_s_tu(dst_vec, 0.0f, vl);         //clean for vfredosum
    dst_vec = __riscv_vfredosum_tu(dst_vec, temp_vec, src_vec, vl);

    r[0] = __riscv_vfmv_f_s_f32m1_f32(dst_vec);

   }
}
00000000800000f0 <foo_vec.constprop.0>:
    800000f0:   04000713                li      a4,64
    800000f4:   82018693                add     a3,gp,-2016 # 8000c020 <foo_x>
    800000f8:   0c0777d7                vsetvli a5,a4,e8,m1,ta,ma
    800000fc:   0907f057                vsetvli zero,a5,e32,m1,tu,ma
    80000100:   42006157                vmv.s.x v2,zero
    80000104:   420060d7                vmv.s.x v1,zero
    80000108:   0927f057                vsetvli zero,a5,e32,m4,tu,ma
    8000010c:   0206e207                vle32.v v4,(a3)
    80000110:   92421257                vfmul.vv        v4,v4,v4
    80000114:   0e4110d7                vfredosum.vs    v1,v4,v2
    80000118:   421017d7                vfmv.f.s        fa5,v1
    8000011c:   40f6a027                fsw     fa5,1024(a3)
    80000120:   8f1d                    sub     a4,a4,a5
    80000122:   fb79                    bnez    a4,800000f8 <foo_vec.constprop.0+0x8>
    80000124:   8082                    ret

No, we don't need such intrinsic.
This is compiler issue.

I have confirm there is a regression in GCC:
https://godbolt.org/z/vocK8cee4

GCC-13 is able to generate optimal vsetvls, wheras GCC trunk doesn't.
I have file a PR for it to recover back the performance:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112776