burst/smoothness calculation
dim-geo opened this issue · 10 comments
Right now burst score is calculated like:
return (new + old * ((1 << smoothness) - 1)) >> smoothness;
which is like:
(new + old (2^ smoothness -1 ))/2^smoothness
If we rearrange terms, it is equal to:
(new-old)/2^smoothness + old
Which helps the user to understand better the role of smoothness factor (maybe document this in readme?)
and we can write it, if new > old like:
(new -old) >>smoothness + old
for new<old, we can use this:
old - ((old-new)>>smoothness)
maybe this calculation is faster than multiplication...
Thank you always for your support.
I read your idea, carefully analyzed every aspect of it, and decided to adopt it.
New implementation is coming in the next release.
Thank you for being open!
On another issue:
Still struggling to understand what log2plus1_u64_u32f8
does.
It is a shortened form of:
#define BITS_PER_OCTET 8
#define BITS_PER_U64 (sizeof(u64) * BITS_PER_OCTET)
result.u8[0] = v << (BITS_PER_U64 - msb) >> (BITS_PER_U64 - (BITS_PER_OCTET + 1));
But I cannot write (9 - msb)
if msb
may be greater than 9, because:
ISO 9899:201x 6.5.7 Bit-wise shift operators:
The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
Maybe you can ditch x32 union like:
static inline u32 log2plus1_u64_u32f8(u64 v) {
u32 result=0;
u16 msb = fls64(v);
int excess_bits = msb - 9;
u8 temp = (0 <= excess_bits)? v >> excess_bits: v << -excess_bits;
result = temp;
result += msb <<8;
return result;
}
and
static inline u32 u8h_u32(u8 v) {
u32 result=v;
result<<=8;
return result;
}
The intention behind this implementation is to use byte-grain memory access to replace/eliminate the traditional bit shift calculations, simplify the code and save the number of instructions.
(The other one u8h_u32 is compiled to the same code to each other by latest compilers.)
Current implementation:
u32 log2plus1_u64_u32f8(u64 v) {
x32 result;
int msb = fls64(v);
int excess_bits = msb - 9;
result.u8[0] = (0 <= excess_bits)? v >> excess_bits: v << -excess_bits;
result.u8[1] = msb;
return result.u32;
}
15-17 instruction cycles depending on branch paths
log2plus1_u64_u32f8:
mov eax, -1
bsrq rdi,rax
mov ecx, eax
lea edx, [rax+1]
sub ecx, 8
js .L2
shr rdi, cl
movzx eax, dil
mov ah, dl
ret
.L2:
mov ecx, 8
sub ecx, eax
sal rdi, cl
movzx eax, dil
mov ah, dl
ret
Proposed implementation:
u32 log2plus1_u64_u32f8_2(u64 v) {
u32 result=0;
u16 msb = fls64(v);
int excess_bits = msb - 9;
u8 temp = (0 <= excess_bits)? v >> excess_bits: v << -excess_bits;
result = temp;
result += msb <<8;
return result;
}
17-19 instruction cycles depending on branch paths
log2plus1_u64_u32f8_2:
mov eax, -1
bsrq rdi,rax
add eax, 1
movzx edx, ax
mov ecx, edx
sub ecx, 9
js .L6
shr rdi, cl
sal eax, 8
movzx edi, dil
and eax, 16776960
add eax, edi
ret
.L6:
mov ecx, 9
sal eax, 8
sub ecx, edx
and eax, 16776960
sal rdi, cl
movzx edi, dil
add eax, edi
ret
Sorry, I was partially wrong about the current and proposed u8h_u32 being compiled into the same instructions.
Clang16 outputs identical code, but GCC13 outputs obviously faster code for your proposed code.
Clang:
u8h_u32:
mov eax, edi
shl eax, 8
ret
u8h_u32_2:
mov eax, edi
shl eax, 8
ret
GCC:
u8h_u32:
xor eax, eax
mov edx, edi
mov ah, dl
ret
u8h_u32_2:
movzx eax, dil
sal eax, 8
ret
u8h_u32: 0.518853 sec
u8h_u32_2: 0.242295 sec
So I'll revert last change for this part to use bit shift.
Thank you!
Can you check in assembly if tolerance
is calculated at every call of calc_burst_penalty?
static inline u32 calc_burst_penalty(struct sched_entity *se) {
u32 greed, tolerance, penalty, scaled_penalty;
greed = log2plus1_u64_u32f8(se->burst_time);
tolerance = sched_burst_penalty_offset << 8;
penalty = max(0, (s32)greed - (s32)tolerance);
scaled_penalty = penalty * sched_burst_penalty_scale >> 10;
return min(MAX_BURST_PENALTY, scaled_penalty);
}
if true, it makes sense to ditch sched_burst_penalty_offset and sysctl tolerance directly.
Tolerance var can have a range from 0 to 16384. default 4608 at cfs.
and transform the code to:
static inline u32 calc_burst_penalty(struct sched_entity *se) {
u32 greed, scaled_penalty;
greed = log2plus1_u64_u32f8(se->burst_time);
if (greed<= tolerance){
return 0;
}
scaled_penalty = ( (s32)greed - (s32)tolerance )* sched_burst_penalty_scale >> 10;
return min(MAX_BURST_PENALTY, scaled_penalty);
}
Great. I understand what you mean. I also considered something like that before.
Back then the offset was as low as 12 (start counting up at 4us burst), and branching didn't have much merit because few tasks could have the score of zero.
But today it's 18 (262us), and majority of tasks usually have the score of zero. It may be useful to make this change now.
Please let me consult the compilers and consider the change.
On a Zen2 CPU I ran two different benchmarks:
- that invokes the function(s) with random bit pattern filled up to random bit positions 0-64 as argument.
- that invokes the function(s) with simple iterator as argument. No costly generating random bit patterns. 100% probability of greed <= tolerance, so that observed performance gain should be maximized.
Unlike we expected, in both benchmarks no much difference was observed. Even with 100% probability of greed <= tolerance, no more than 3% of (near function-level) performance gain was observed.
This result was surprising to me.
On simpler architecture we may see a different tendency, though.
Thanks for all the contribution!
Now I'm closing the issue.