firelzrd/bore-scheduler

burst/smoothness calculation

dim-geo opened this issue · 10 comments

Right now burst score is calculated like:

return (new + old * ((1 << smoothness) - 1)) >> smoothness;

which is like:
(new + old (2^ smoothness -1 ))/2^smoothness

If we rearrange terms, it is equal to:

(new-old)/2^smoothness + old
Which helps the user to understand better the role of smoothness factor (maybe document this in readme?)

and we can write it, if new > old like:
(new -old) >>smoothness + old

for new<old, we can use this:

old - ((old-new)>>smoothness)

maybe this calculation is faster than multiplication...

Thank you always for your support.
I read your idea, carefully analyzed every aspect of it, and decided to adopt it.
New implementation is coming in the next release.

Thank you for being open!
On another issue:
Still struggling to understand what log2plus1_u64_u32f8 does.

It is a shortened form of:

#define BITS_PER_OCTET 8
#define BITS_PER_U64 (sizeof(u64) * BITS_PER_OCTET)
result.u8[0] = v << (BITS_PER_U64 - msb) >> (BITS_PER_U64 - (BITS_PER_OCTET + 1));

But I cannot write (9 - msb) if msb may be greater than 9, because:

ISO 9899:201x 6.5.7 Bit-wise shift operators:

The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.

Maybe you can ditch x32 union like:

static inline u32 log2plus1_u64_u32f8(u64 v) {
u32 result=0;
u16 msb = fls64(v);
int excess_bits = msb - 9;
u8 temp = (0 <= excess_bits)? v >> excess_bits: v << -excess_bits;
result = temp;
result += msb <<8;
return result;
}

and

static inline u32 u8h_u32(u8 v) {
u32 result=v;
result<<=8;
return result;
}

The intention behind this implementation is to use byte-grain memory access to replace/eliminate the traditional bit shift calculations, simplify the code and save the number of instructions.
(The other one u8h_u32 is compiled to the same code to each other by latest compilers.)

Current implementation:

u32 log2plus1_u64_u32f8(u64 v) {
	x32 result;
	int msb = fls64(v);
	int excess_bits = msb - 9;
	result.u8[0] = (0 <= excess_bits)? v >> excess_bits: v << -excess_bits;
	result.u8[1] = msb;
	return result.u32;
}

15-17 instruction cycles depending on branch paths

log2plus1_u64_u32f8:
        mov     eax, -1
        bsrq rdi,rax
        mov     ecx, eax
        lea     edx, [rax+1]
        sub     ecx, 8
        js      .L2
        shr     rdi, cl
        movzx   eax, dil
        mov     ah, dl
        ret
.L2:
        mov     ecx, 8
        sub     ecx, eax
        sal     rdi, cl
        movzx   eax, dil
        mov     ah, dl
        ret

Proposed implementation:

u32 log2plus1_u64_u32f8_2(u64 v) {
	u32 result=0;
	u16 msb = fls64(v);
	int excess_bits = msb - 9;
	u8 temp = (0 <= excess_bits)? v >> excess_bits: v << -excess_bits;
	result = temp;
	result += msb <<8;
	return result;
}

17-19 instruction cycles depending on branch paths

log2plus1_u64_u32f8_2:
        mov     eax, -1
        bsrq rdi,rax
        add     eax, 1
        movzx   edx, ax
        mov     ecx, edx
        sub     ecx, 9
        js      .L6
        shr     rdi, cl
        sal     eax, 8
        movzx   edi, dil
        and     eax, 16776960
        add     eax, edi
        ret
.L6:
        mov     ecx, 9
        sal     eax, 8
        sub     ecx, edx
        and     eax, 16776960
        sal     rdi, cl
        movzx   edi, dil
        add     eax, edi
        ret

Sorry, I was partially wrong about the current and proposed u8h_u32 being compiled into the same instructions.
Clang16 outputs identical code, but GCC13 outputs obviously faster code for your proposed code.

Clang:

u8h_u32:
        mov     eax, edi
        shl     eax, 8
        ret
u8h_u32_2:
        mov     eax, edi
        shl     eax, 8
        ret

GCC:

u8h_u32:
        xor     eax, eax
        mov     edx, edi
        mov     ah, dl
        ret
u8h_u32_2:
        movzx   eax, dil
        sal     eax, 8
        ret
u8h_u32: 0.518853 sec
u8h_u32_2: 0.242295 sec

So I'll revert last change for this part to use bit shift.
Thank you!

Can you check in assembly if tolerance is calculated at every call of calc_burst_penalty?

static inline u32 calc_burst_penalty(struct sched_entity *se) {
	u32 greed, tolerance, penalty, scaled_penalty;
	
	greed = log2plus1_u64_u32f8(se->burst_time);
	tolerance = sched_burst_penalty_offset << 8;
	penalty = max(0, (s32)greed - (s32)tolerance);
	scaled_penalty = penalty * sched_burst_penalty_scale >> 10;

	return min(MAX_BURST_PENALTY, scaled_penalty);
}

if true, it makes sense to ditch sched_burst_penalty_offset and sysctl tolerance directly.
Tolerance var can have a range from 0 to 16384. default 4608 at cfs.

and transform the code to:

static inline u32 calc_burst_penalty(struct sched_entity *se) {
	u32 greed, scaled_penalty;
	
	greed = log2plus1_u64_u32f8(se->burst_time);
        if (greed<= tolerance){
           return 0;
        }
	scaled_penalty = ( (s32)greed - (s32)tolerance )* sched_burst_penalty_scale >> 10;
	return min(MAX_BURST_PENALTY, scaled_penalty);
}

Great. I understand what you mean. I also considered something like that before.
Back then the offset was as low as 12 (start counting up at 4us burst), and branching didn't have much merit because few tasks could have the score of zero.
But today it's 18 (262us), and majority of tasks usually have the score of zero. It may be useful to make this change now.
Please let me consult the compilers and consider the change.

On a Zen2 CPU I ran two different benchmarks:

  1. that invokes the function(s) with random bit pattern filled up to random bit positions 0-64 as argument.
  2. that invokes the function(s) with simple iterator as argument. No costly generating random bit patterns. 100% probability of greed <= tolerance, so that observed performance gain should be maximized.

Unlike we expected, in both benchmarks no much difference was observed. Even with 100% probability of greed <= tolerance, no more than 3% of (near function-level) performance gain was observed.
This result was surprising to me.
On simpler architecture we may see a different tendency, though.

Thanks for all the contribution!
Now I'm closing the issue.