riscv-non-isa/rvv-intrinsic-doc

vget for fractional register doesn't exist

howjmay opened this issue · 10 comments

There seem to be no intrinsics vget that take a whole register and return a fractional register. Namely, intrinsics like vint8mf2 __riscv_vget_v_i8m1_i8mf2(vint8m1_t src, size_t index) don't exist. The same behavior can be achieved by trunc, and slidedown intrinsics, but it seems to be good to have them.

The intentions to have these intrinsics are

  1. It is weird that the integer LMUL has vget intrinsics, and the fractional ones don't have.
  2. The extra overhead caused by the slidedown here is strange.
  3. It is not instinct, so the RVV beginners may take some time to find the combination to achieve the behavior

The 2nd reason should not exist now, since according to @topperc

A vget for fractional would have to do the same slidedown in order to the elements into the lower elements of a register to meet up with later intrinsics.

I stand by my comments in the other ticket
riscv-non-isa/riscv-c-api-doc#54 (comment)
but let me address the intentions here directly.

  1. It is weird that the integer LMUL has vget intrinsics, and the fractional ones don't have.

This is a subjective comment. It is not weird to me. There is a fundamental difference between the behavior of the current (integer LMUL) and proposed (fractional LMUL) intrinsics: the current ones do not require any instructions to be generated, while the proposed ones will require data movement instructions (e.g., vslidedown).

  1. The extra overhead caused by the slidedown here is strange.

This is a subjective comment. It is not strange to me. This overhead is necessary due to the aforementioned fundamental difference.

  1. It is not instinct, so the RVV beginners may take some time to find the combination to achieve the behavior

This is a subjective comment. As someone with some RVV experience, the current approach makes more intuitive sense. It would be counterintuitive for me if some vget intrinsics had no performance overhead, while others did.

I don't feel terribly strongly about any of this. But I don't think this is sufficiently important to address in v1.0 of the API.

I understand (from the other issue) that this is arising from your Neon-to-RVV translation project. Instead of proposing a solution (new API), perhaps it would be more productive to explain the particular translation challenge you are facing.

dzaima commented

Also, note that for a NEON int8x16_t upper half to int8x8_t function you would not want __riscv_vget_v_i8m1_i8mf2 even if it existed - such an intrinsic would slide down by VLEN/16 elements, but you want to slide down by exactly 8 elements here; so it'd work correctly if VLEN==128, and not work anywhere else. Rather, you should use e.g. __riscv_vslidedown_vx_i8m1(x, 8, 16).

@nick-knight I agree with your points, so I prefer to close this issue.
No big challenges I am facing right now. I understand that I can use the combination of vslidedown to achieve the behavior (at the beginning I wasn't able to achieve it, since I was familiar to RVV enough). The overhead was the only practical issue, but you have explained that it is unavoidable.

I think I wrongly assume the behavior of fractional LMUL should be the same as integer ones

But I am curious where can I find this information. I am sorry that I didn't see it. I was trying to figure out understanding the difference between the usage and overhead of integer and fractional LMUL.

There is a fundamental difference between the behavior of the current (integer LMUL) and proposed (fractional LMUL) intrinsics: the current ones do not require any instructions to be generated, while the proposed ones will require data movement instructions (e.g., vslidedown).

Also, note that for a NEON int8x16_t upper half to int8x8_t function you would not want __riscv_vget_v_i8m1_i8mf2 even if it existed - such an intrinsic would slide down by VLEN/16 elements, but you want to slide down by exactly 8 elements here; so it'd work correctly if VLEN==128, and not work anywhere else. Rather, you should use e.g. __riscv_vslidedown_vx_i8m1(x, 8, 16).

Thank you for the suggestion!

dzaima commented

The current vget intrinsics work by just extracting a register group part - e.g. for __riscv_vget_v_i8m2_i8m1 the input is two whole registers, and the result is just the second register - so the operation is entirely free.

Whereas, for a hypothetical __riscv_vget_v_i8m1_i8mf2, the input is already a single register, and the result is "half" a register. In reality, though, it's a full register (there are no fractional registers in RVV) but with the upper half ignored. Thus, a __riscv_vget_v_i8m1_i8mf2(x, 1) wouldn't have any whole register to return.

But I am curious where can I find this information. I am sorry that I didn't see it. I was trying to figure out understanding the difference between the usage and overhead of integer and fractional LMUL.

I agree it's not clear in the documentation of this repo, and I support your effort to keep digging into this until you understand it. In general, to obtain a deeper understanding of the rationale behind certain RVV intrinsics API decisions, you'll need to consult the ISA spec:

https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc , which will eventually be merged into the ISA manual (riscv/riscv-isa-manual#1088)

The discussion most relevant to fractional LMUL appears in Sections 3.4.2 and 4.

The current vget intrinsics work by just extracting a register group part - e.g. for __riscv_vget_v_i8m2_i8m1 the input is two whole registers, and the result is just the second register - so the operation is entirely free.

Whereas, for a hypothetical __riscv_vget_v_i8m1_i8mf2, the input is already a single register, and the result is "half" a register. In reality, though, it's a full register (there are no fractional registers in RVV) but with the upper half ignored. Thus, a __riscv_vget_v_i8m1_i8mf2(x, 1) wouldn't have any whole register to return.

Thank you for explaining! this explain a lot

But I am curious where can I find this information. I am sorry that I didn't see it. I was trying to figure out understanding the difference between the usage and overhead of integer and fractional LMUL.

I agree it's not clear in the documentation of this repo, and I support your effort to keep digging into this until you understand it. In general, to obtain a deeper understanding of the rationale behind certain RVV intrinsics API decisions, you'll need to consult the ISA spec:

https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc , which will eventually be merged into the ISA manual (riscv/riscv-isa-manual#1088)

The discussion most relevant to fractional LMUL appears in Sections 3.4.2 and 4.

Thanks! I have referred this doc in my current implementation.