Not filling branch delay slot by moving CHERI instructions or other instructions across CHERI instructions
jonwoodruff opened this issue · 10 comments
When this is built:
int partition( int a[], int l, int r) {
int pivot, i, j, t;
pivot = a[l];
i = l; j = r+1;
while( 1)
{
do ++i; while( a[i] <= pivot && i <= r );
do --j; while( a[j] > pivot );
if( i >= j ) break;
t = a[i]; a[i] = a[j]; a[j] = t;
}
t = a[l]; a[l] = a[j]; a[j] = t;
return j;
}
void quickSort( int a[], int l, int r)
{
int j;
if( l < r ) {
j = partition( a, l, r);
quickSort( a, l, j-1);
quickSort( a, j+1, r);
}
}
The instructions around the "JALR" (the only one) when built in purecap are:
sll $5, $1, 0
cmove $c13, $c19
cgetpccsetoffset $c12, $3
cjalr $c12, $c17
nop
The cmove or the sll could be in the branch delay slot.
When built for MIPS the branch delay slot is filled:
move $4, $17
move $5, $2
sw $1, 0($8)
addiu $1, $21, -1
jalr $25
sll $6, $1, 0
Slightly simplified test case:
__attribute__((always_inline))
static int partition( int a[], int l, int r) {
int pivot, i, j, t;
pivot = a[l];
i = l; j = r+1;
while(i >= j)
{
do --j; while( a[j] > pivot );
a[i] = a[j]; a[j] = t;
}
return j;
}
void quickSort( int a[], int l, int r)
{
int j;
j = partition( a, l, r);
quickSort( a, l, j-1);
quickSort( a, j+1, r);
}
n64 version:
addiu $1, $20, -1
ld $25, %call16(quickSort)($gp)
sll $6, $1, 0
move $4, $17
jalr $25
move $5, $2
addiu $5, $20, 1
Pure-cap version:
addiu $1, $18, -1
ld $3, %call16(quickSort)($gp)
sll $5, $1, 0
cgetpccsetoffset $c12, $3
cmove $c3, $c18
move $4, $2
cmove $c13, $c19
cjalr $c12, $c17
nop
addiu $4, $18, 1
It looks as if cmove isn't being put in the delay slot because it's marked as having unmodelled side effects. This is also preventing any instructions from being reordered across it.
Actually, it looks as if this is set for pretty much all capability instructions, which is probably impeding a lot of potential optimisations.
And this is required because of the implicit C0 behaviour. The real fix probably involves adding an implicit use of C0 to all of the MIPS loads and stores.
I wonder how crazy it is to simply disable all modifications to special capability registers except for those dedicated instructions? How much overhead will there be to lose the ability to directly read/write C0 in all capability instructions? When I say disable I mean on a hardware level, so implicit C0 modifications will do nothing or trigger exceptions, etc. We already have similar plans in our document. Of course this would be at least a flag week...
It would be nice to have an experimental run and see. I don't think that we generate stores to $c0 from anything other than an intrinsic in the compiler. We do rely on being able to read $c0 for ctoptr, but I don't think we ever insert modifications.
Ideally, I'd like to make $c0 a capability version of $zero, make $ddc a special register, and have special cases for ctoptr that used $ddc and $ppc.
Modifying the MIPS loads and stores implicitly use $c0 is complicated by the fact that $c0 is not present on MIPS...
I guess we should wait until we have CWriteHwr
and then only treat that as a hazard?
Ideally, yes, though I believe it's possible to teach the LLVM back end that MIPS always has C0, but doesn't have any instructions to write to it...
This is now fixed in the multicapsize branch. It can probably be cherry picked to master, or we can wait until it's time to merge that branch.