riscv-non-isa/riscv-toolchain-conventions

RV64 conventions for asm templates with 32-bit operands

cliffordwolf opened this issue · 15 comments

On RV64 32-bit values are canonically stored sign-extended in a 64-bit register. This poses a challenge for code utilizing assembler templates, because the compiler may end up adding unnecessary sext.w instructions to the generated code.

The current behavior in GCC seems to be to not bother with keeping this invariant for input operands to assembler templates, and to assume that the assembler template doesn't output 32-bit values in sign extended form.

For example, consider the following experiment:

int test(long a, long b) {
        int x, y = a^b;
        __asm__ ("foobar %0,%0" : "=r"(x) : "r"(y));
        return x;
}

This compiles to the following asm code (with riscv64-unknown-elf-gcc -Os -S -o - asm.c):

test:
	xor	a0,a0,a1
	foobar a0,a0
	sext.w	a0,a0
	ret

I would argue that the opposite semantic would actually make more sense:

  • 32-bit input operands to asm templates must be in canonical sign-extended form
  • 32-bit output operands of asm templates may be assumed to hold sign-extended values

This would allow "chaining" 32-bit asm templates, and interleaving 32-bit asm templates with regular 32-bit arithmetic, without the need for inserting sext.w instructions.

Asm templates that output 32-bit values that aren't in in canonical sign-extended form must be used with a 64-bit variable as output operand. Casting this 64-bit value to a 32-bit type would then add sext.w as needed.

Asm templates that accept any 32-bit values in non-canonical form may be used with a 64-bit expression as input operand.

Furthermore, I think it would make sense to define two builtin functions with the following semantic:

int __builtin_reinterpret_l2i(long x): Assumes x is in the range INT_MIN..INT_MAX and returns the value of x as 32-bit int type.

long __builtin_reinterpret_i2l(int x): Returns x in the lower 32 bits of the result, with an unspecified value in the upper 32 bits.

These functions can then be used to suppress the generation of unnecessary sext.w instructions. They are like C++ reinterpret_cast<>, except for register contents instead of memory contents.

Initially supporting this semantic in the existing tools should be easy, considering that the compiler is always free to add extra sext.w instructions. Thus, the two builtin functions proposed above may simply be aliases for type casts in an initial implementation. For GCC it would be necessary to add additional sext.w for 32-bit input operands in order to stay within the semantic proposed above.

PS: I've proposed this behavior for asm templates on sw-dev back in June (link). I don't think I've previously proposed the above builtin functions in a public forum.

32-bit output operands of asm templates may be assumed to hold sign-extended values

It's not true when the operation isn't sign-extended, such as addwu, subwu and addiwu in B-extension, and there might have some other non-standard extension don't sign-extend. .

I guess one of your goal is eliminate the redundant sign-extension when using functions in rvintrin.h, this problem could be resolve by made a real intrinsic function, then compiler will know the semantic.

Changing the definition and implementation of asm is problematic. A better solution as Kito mentioned is to avoid using asms in the first place, and use RISC-V specific built-in functions that can give the right result.

For __builtin_reinterpret_l2i and i2l, int is not 32-bits for all targets. If this is supposed to be a RISC-V specific builtin, then it needs to be in the RISC-V namespace, e.g. __builtin_riscv_reinterpret_l2i. If this is supposed to be a generic GCC builtin, then it needs to be discussed on the FSF GCC mailing list. But I think this is a hard sell, as the reasons for it are primarily gcc RISC-V implementation issues, and there are better ways to solve those problems.

Changing the definition and implementation of asm is problematic.

I don't think it's really changing the definition because afaict there currently is no definition and from experiments I can tell that sometimes (but not always!) gcc drops the sext.w on input operands. But just because I don't see that happen in my experiments with gcc on output operands it doesn't mean I have a guarantee that the compiler (gcc or any other) will always inject it.

I think it would already be an improvement over the current state do specify what behavior the compiler guarantees in this regard, and what behavior the compiler expects the template to adhere to.

Right now, as a compiler user, the only sane thing for me would be to never use a 32 bit operand on an asm template, and add sext.w to the asm template itself where I need it. But this means I can't mix asm templates with 32-bit arithmetic without ending up with many unnecessary sext.w.

So I think better than specifying whatever behavior gcc currently has, it would make sense to specify a behavior that actually makes sense.

I guess one of your goal is eliminate the redundant sign-extension when using functions in rvintrin.h, this problem could be resolve by made a real intrinsic function, then compiler will know the semantic.

Actually no. It's a more general concern. rvintrin.h will move to intrinsic functions. No question about it. But there are still many use-cases for writing inline asm, and compiler intrinsics for individual instructions doesn't help much when you for example want control over instruction scheduling for a whole block. I think there is great value to providing users with a way of integrating inline asm with C code efficiently.

If this is supposed to be a RISC-V specific builtin, then it needs to be in the RISC-V namespace, e.g. __builtin_riscv_reinterpret_l2i.

Yes, this is meant as a riscv-specific builtin. So __builtin_riscv_reinterpret_l2i. I don't particularly like the reinterpret-i2l/l2i naming scheme. It was just the first thing that came to mind.

Note that just the i2l builtin and the asm semantic that I'm proposing covers all cases of 32-bit asm operands:

(1) 32-bit input, expected sign-extended: Simply pass as-is.

(2) 32-bit input, ignoring upper 32 bits: Pass __builtin_riscv_reinterpret_i2l(x).

(3) 32-bit output, already sign-extended: Simply output as 32-bit type.

(4) 32-bit output, but not yet sign-extended: Output as 64-bit type. Then cast to 32-bit int.

With the presumed current asm template semantic a l2i function could be used to implement (3). But I don't see how (1) could be accomplished.

Keeping the presumed current behavior of the output and changing the behavior of inputs to my proposal would be unproblematic in terms of compatibility, because the new semantic would be a subset of the current behavior.

It would also be less error prone, because currently a user must be aware that they need to sext.w an input in their asm code if they require it to be sign extended.

And by adding both l2i and i2l reinterpret functions we could then also cover all four cases listed above.

asms aren't a RISC-V feature. They are a generic feature. So you are proposing changes that potentially affect all targets, which has a high bar for acceptance. Perhaps we can implement this via a hook, which only RISC-V defines, but then there is a high cost of maintaining something that only RISC-V uses. On the plus side, RISC-V is not the only target affected. MIPS has the exact same problems. But on the other hand, the MIPS folks have lived with this problem for decades and never needed to fix it, so why do we need to fix it?

Personally, if you want to stay sane, I would strongly recommend that you never use asms. They allow you to scribble unchecked on gcc internals. This makes them very powerful, but also very dangerous. The exact meaning of asms can change from one gcc version to the next as gcc internals change. They really should be avoided except where they are unavoidable, like maybe an OS kernel. For application code, there is a much better solution, intrinsics aka built-in functions.

There is also the issue that you can ask for gcc changes, but there is no guarantee that anyone will ever implement them. You shouldn't make plans that require someone else to do work for you. But intrinsics for bitmanip instructions is a useful thing, and a tractable problem, so it is likely that someone like Kito and/or me will be willing to write them.

Keeping the presumed current behavior of the output and changing the behavior of inputs to my proposal would be unproblematic in terms of compatibility, because the new semantic would be a subset of the current behavior.

It would also be less error prone, because currently a user must be aware that they need to sext.w an input in their asm code if they require it to be sign extended.

That would also match the RISC-V calling convention, which would make it the most intuitive semantic imo.

asms aren't a RISC-V feature. They are a generic feature. So you are proposing changes that potentially affect all targets, which has a high bar for acceptance.

How asm operands are passed in registers certainly is architecture specific. I'm not proposing a change that effects, as you write, all targets. Please explain to me in what way my proposed change would effect the semantic of x86_64 asm templates.

MIPS has the exact same problems.

No. It has not. mips64-linux-gnuabi64-gcc -o - -S -Os asm.c on the above file produces the following implementation:

test:
	xor	$2,$4,$5
	sll	$2,$2,0
	foobar $2,$2
	jr	$31

The sll $2,$2,0 instruction zero extends the 32-bit input operand to the asm template, whereas the 32-bit output operand seems to be expected to be already in zero-extended form. This is exactly the behavior I am proposing in my second post. (Except that MIPS is using zero extension and RISC-V uses sign extension of course.)

(This is with gcc version 7.4.0-1ubuntu1~18.04.1, the default mips cross gcc on Ubuntu 18.04.3 LTS.)

And here is aarch64-linux-gnu-gcc -o - -S -Os asm.c:

test:
	eor	w0, w0, w1
	foobar x0,x0
	ret

This is also the behavior I am proposing in my second post. (eor on w register names produces a zero-extended 32-bit result.)

Personally, if you want to stay sane, I would strongly recommend that you never use asms.

I think this is a valid personal opinion, but it is in no way a good argument for crippling asm templates in RISC-V. The use of asm templates in low-level C code is a reality in the industry and I think we should try to provide reasonable support for asm templates to users of RISC-V.

But intrinsics for bitmanip instructions is a useful thing, and a tractable problem, so it is likely that someone like Kito and/or me will be willing to write them.

As I wrote above, this is not about intrinsics for bitmanip instructions.

As I wrote above, I fully agree that for those the way to go is GCC builtin functions.

As I wrote above, "there are still many use-cases for writing inline asm, and compiler intrinsics for individual instructions [don't] help much when you for example want control over instruction scheduling for a whole block".

Also, note that (1) I just propose a semantic for what guarantees can be expected from a RISC-V compiler and what the compiler can expect from a template. I do not propose that we have compiler implementations that produce perfect code for those templates right away. I'm totally fine with the compiler producing sub-optimal code for now. Bit if we define the semantic in a way that prevents us from ever producing optimal code then this is a problem imo.

And (2) AFAICT there is no current formal definition of a semantic for 32-bit asm operands on RV64. This is a big problem imo. The semantic needs to be documented. Even if there's no chance that we can change the semantic to something more reasonable, I don't see how this could possibly justify leaving the semantic, whatever it is, undocumented.

There are no guarantees when using an asm. GCC reserves the right to change the semantics at any time for any reason. And has done so a number of times. I think it would be dangerous to document any behavior, as then people might expect that later gcc versions have the same behavior, and we can't guarantee that. If you want well defined behavior, don't use an asm.

See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85185
for an example that is broken on MIPS same as RISC-V.

I think it would be dangerous to document any behavior, as then people might expect that later gcc versions have the same behavior, and we can't guarantee that.

Then the compiler should output a warning whenever a variable smaller than xlen is passed as asm operand.

Edit: Or, actually, the compiler should issue an error.

Alternatively, instead of issuing a compiler error, the definition could be the following:

  • 32-bit input operands may not be properly sign extended (by the compiler)
  • 32-bit output operands must be properly sign extended (by the template)

That would be compatible with all sensible behaviors of any compiler or compiler version, but would allow an optimizing compiler to get rid of any sext.w on asm template operands.

Remember: I said I'd be totally fine with the compiler generating suboptimal code, I just want a definition that will allow us to write correct asm templates.

In that case an asm template that wants an input operand sign-extended would simply need to cast that value to a 32-bit data type first.

Similarly, if an output operand isn't already sign-extended, the template would simply be required to output it into a 64-bit variable.

This is also the exact definition I originally proposed in June.
https://groups.google.com/a/groups.riscv.org/forum/#!msg/sw-dev/9MNUGQwxojY/aYIeT3KzAwAJ

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85185 for an example that is broken on MIPS same as RISC-V.

That behavior would be perfectly within the above definition.

I think it would be dangerous to document any behavior

Please explain to me in what way defining the above behavior, the one I first proposed in June, would be dangerous. What possible change could there in a future gcc version that would cause troubles with this definition?

If you want well defined behavior, don't use an asm.

You are arguing against a strawman. I'm not asking for well defined behavior. I'm asking for a definition what is defined and what isn't. So that programs can written in ways that avoid the undefined things, and compilers (at least in theory) can be written in a way so that they don't insert instructions in order to provide well defined behavior where none is guaranteed.

You are wasting your time and my time. Nothing will ever happen because of this. If you want something to change in gcc, you need to start a discussion on an FSF GCC mailing list, or file a bug into FSF GCC bugzilla.

If you want something to change in gcc

I think I wrote now a couple of times that I don't need anything in GCC to change. Please just read what I wrote! This is so Kafkaesque! WTF is going on here!!!?!?!?!

If you have a problem with what I wrote then reply with an actual argument against what I wrote. But constantly behaving as if I had written something else is just completely unacceptable! Why do you feel you have to behave in this way!?

Since there's obviously not even a willingness to at least read what I'm proposing I am now closing this issue. My June proposal would not have required any GCC changes, would be compatible with all future compiler behaviors one could expect, and would have provided the information necessary to compiler writers and compiler users alike. But when you are not even reading it then it doesn't matter. I can't deal with this Kafkaesque bullshit.

Is there not a simpler solution to this than these platform-specific intrinsics or redefining the semantics of all inline assembly?

How about a new risc-v specific asm register constraint? "r" currently means "pass this in an xlen register", which is reasonable and should not change. That doesn't mean to say we cannot have another constraint letter to denote "32-bit variable in an xlen register", which you can then use in input/output constraints without the extra intrinsics. You would even be able to use this in code that targets both riscv32 and riscv64, I think.

Is there not a simpler solution to this than these platform-specific intrinsics or redefining the semantics of all inline assembly?

Please read the "June proposal":

  • 32-bit input operands may be not properly sign extended (by the compiler)
  • 32-bit output operands must be properly sign extended (by the template)

That's the most conservative definition possible and doesn't require to change any actual compiler. But an optimizing compiler would be allowed by this calling convention to optimize away additional sext.w the compiler would otherwise have to add.

Note that casts to/from 64-bit values can be used to force gcc to perform the sign-extension outside the template when needed.

Afaict this addresses all issues raised in this thread and the original mailing list thread. Too bad nobody even wants to read those two sentences before categorically dismissing the proposal.

How about a new risc-v specific asm register constraint? "r" currently means "pass this in an xlen register", which is reasonable and should not change.

See the email thread I linked in the OP https://groups.google.com/a/groups.riscv.org/forum/#!msg/sw-dev/9MNUGQwxojY/aYIeT3KzAwAJ. Using those letters was my original proposal there. It turn's out that's really difficult to do in the current code base.