Strange perforamnce drops with const literals in closures
idkravitz opened this issue ยท 25 comments
I asked about it on user.rust-lang.org and one user suggested that it may be a bug in optimizer indeed. So I should state that the things below are reproducable both in nightly and stable latest rust on x86_64-pc-windows-msvc triplet and x86_64-unknown-linux-gnu (tested inside WSL).
So our test subject would be the following snippet, that solves Max subarray problem
pub fn max_subarray_bad(arr: &[i32]) -> (usize, usize, i32)
{
let prefixes = arr
.iter()
.enumerate()
.scan((0, 0), |s, (i, v)| {
if s.1 > 0 {
s.1 = s.1 + *v;
} else {
*s = (i, *v);
}
Some(*s)
});
let (right_idx, (left_idx, sum)) = prefixes
.enumerate()
.max_by_key(|&(_, (_, sum))| sum)
.unwrap();
(left_idx, right_idx + 1, sum)
}
If we benchmark it with criterion crate with this benchmark code:
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn benchmark_linear(c: &mut Criterion) {
const N: usize = 1000000;
c.bench_function(&format!("max_subarray([..]) N = {}", N), |b| {
b.iter(|| max_subarray::max_subarray_bad(black_box(&vec![0; N])))
});
}
criterion_group!(benches, benchmark_linear);
criterion_main!(benches);
Then the output of cargo bench
on my machine would be
Benchmarking max_subarray([..]) N = 1000000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 16.6s or reduce sample count to 30.
max_subarray([..]) N = 1000000
time: [3.2324 ms 3.2700 ms 3.3141 ms]
Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) high mild
4 (4.00%) high severe
Running target\release\deps\scratch-b68a42551ab01289.exe
But with the slight change of moving out 0 in expression s.1 > 0
in let binding outside of the closure can make a great difference. So the function is now this:
pub fn max_subarray_bad(arr: &[i32]) -> (usize, usize, i32)
{
let zro = 0;
let prefixes = arr
.iter()
.enumerate()
.scan((0, 0), |s, (i, v)| {
if s.1 > zro {
s.1 = s.1 + *v;
} else {
*s = (i, *v);
}
Some(*s)
});
let (right_idx, (left_idx, sum)) = prefixes
.enumerate()
.max_by_key(|&(_, (_, sum))| sum)
.unwrap();
(left_idx, right_idx + 1, sum)
}
But cargo bench
output indicates almost 20% performance gain!
Benchmarking max_subarray([..]) N = 1000000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.9s or reduce sample count to 40.
max_subarray([..]) N = 1000000
time: [2.5705 ms 2.5806 ms 2.5913 ms]
change: [-20.260% -19.668% -19.124%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
Running target\release\deps\scratch-b68a42551ab01289.exe
You can check that changing function back and forth with replacing 0 and zro in that expression indeed results in 20% performance change.
By the way, if we change let zro = 0
into const zro: i32 = 0
it results in performance drop too.
It looks like a bug in optimizer for me. Could someone verify it?
OT hint - if you append triple backticks with a word rust
like ```rust
, github will highlight syntax ๐
OT hint - if you append triple backticks with a word
rust
like```rust
, github will highlight syntax
Thanks, I was wondering whats wrong with syntax highlight ๐
I also just checked it inside WSL with stable and nightly -x86_64-unknown-linux-gnu and the results were 33% performance difference on my machine.
It's not the optimizer: https://godbolt.org/z/LcLYaw
You can try disabling incremental compilation and reducing codegen-units to 1 to see if you can still reproduce it.
How weird indeed, if I disable incremental and set codegen-units to 1 both versions run in slowest time, but if I leave values at default the one with let runs in 20% faster.
In a case of incremental=false codegen-units=1
:
Benchmarking max_subarray([..]) N = 1000000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 16.7s or reduce sample count to 30.
max_subarray([..]) N = 1000000
time: [3.3078 ms 3.3099 ms 3.3130 ms]
Found 12 outliers among 100 measurements (12.00%)
2 (2.00%) low mild
3 (3.00%) high mild
7 (7.00%) high severe
Benchmarking max_subarray([..]) N = 1000000 #2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 16.6s or reduce sample count to 30.
max_subarray([..]) N = 1000000 #2
time: [3.2961 ms 3.2976 ms 3.2992 ms]
Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) high mild
4 (4.00%) high severe
If I comment out profile settings in Cargo.toml, the output is:
Benchmarking max_subarray([..]) N = 1000000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 17.1s or reduce sample count to 30.
max_subarray([..]) N = 1000000
time: [3.3745 ms 3.3775 ms 3.3808 ms]
change: [+1.8443% +2.0076% +2.1392%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
5 (5.00%) high mild
6 (6.00%) high severe
Benchmarking max_subarray([..]) N = 1000000 #2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 13.4s or reduce sample count to 40.
max_subarray([..]) N = 1000000 #2
time: [2.6242 ms 2.6262 ms 2.6286 ms]
change: [-20.425% -20.281% -20.087%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
4 (4.00%) high mild
4 (4.00%) high severe
So the weird thing is, that I located that code-units=16 actually allow to produce more effective code for bench than code-units=1 (for the one with let bind)
BTW, rm -rf ./target
doesn't change anything.
I made a setup crate for this issue, you can check it here
P.S. or it may be a bug in a criterion I didn't double checked the runtime with clock :)
It also might not be a bug at all, benchmarking such things is incredible tricky, you might get hit by code alignment issues or just sheer bad luck. When I benchmark the first function that you posted multiple times in a row, I sometimes get around 10 % differences (by running the SAME code). And as @mati685 posted, it seems to generate exactly the same assembly, as it should.
Really the only way to check what happens is for you to disassemble the two binaries produced on your system and check that the assembly is indeed the same. If it is, it's not a rustc issue :-)
Try compiling with debug
symbols and then dissassemble the binaries for example with objdump -S -l
to see if the assembly instructions match between the two versions.
Btw, the absolute same issue happens if I throw away criterion and use built-in test::Bencher (by same I mean it responds to code-units setting)
running 2 tests
tests::benchmark_bad ... bench: 3,293,805 ns/iter (+/- 17,375)
tests::benchmark_good ... bench: 3,309,870 ns/iter (+/- 46,277)
test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out
C:\Users\idkra\dev\rust\issue-68632>cargo bench
Compiling issue-68632 v0.1.0 (C:\Users\idkra\dev\rust\issue-68632)
Finished bench [optimized] target(s) in 0.51s
Running target\release\deps\issue_68632-80a8ccd9df83eb66.exe
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running target\release\deps\bench_builtin-30e90d201571b52f.exe
running 2 tests
test tests::benchmark_bad ... bench: 3,375,950 ns/iter (+/- 32,356)
test tests::benchmark_good ... bench: 2,596,490 ns/iter (+/- 14,938)
test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out
Try compiling with debug symbols and then dissassemble the binaries for example with objdump -S -l to see if the assembly instructions match between the two versions.
Well, its past-midnight in my TZ, so I'll dig into it tomorrow.
P.S.
I have tested this issue on two machines running Windows 10 and both reproduced it.
So as a last thing today, I made a simple binary crate with the following main.rs:
pub fn max_subarray_bad(arr: &[i32]) -> (usize, usize, i32)
{
let zro: i32 = 0; // substitute let with const for slow version and vice versa
let prefixes = arr
.iter()
.enumerate()
.scan((0, 0), |s, (i, v)| {
if s.1 > zro {
s.1 = s.1 + *v;
} else {
*s = (i, *v);
}
Some(*s)
});
let (right_idx, (left_idx, sum)) = prefixes
.enumerate()
.max_by_key(|&(_, (_, sum))| sum)
.unwrap();
(left_idx, right_idx + 1, sum)
}
fn main() {
const N: usize = 1000000;
let v = vec![0; N];
const loops: usize = 1000;
let mut total_sum = 0;
for i in 0..loops {
let (_, _, sum) = max_subarray_bad(&v);
total_sum += sum;
}
println!("{}", total_sum);
}
Then I compiled two versions with let and without and run time ./my_binary
a few times. Here are my results:
$ time ../target/release/stupid-way
0
real 0m2.198s
user 0m2.188s
sys 0m0.000s
$ time ../target/release/stupid-way
0
real 0m2.195s
user 0m2.203s
sys 0m0.016s
$ time ../target/release/stupid-way
0
real 0m2.192s
user 0m2.188s
sys 0m0.000s
$ time ../target/release/stupid-way
0
real 0m2.188s
user 0m2.172s
sys 0m0.031s
$ time ../target/release/stupid-way
0
real 0m2.191s
user 0m2.172s
sys 0m0.000s
And for variant with let:
$ time ../target/release/stupid-way
0
real 0m1.409s
user 0m1.406s
sys 0m0.000s
$ time ../target/release/stupid-way
0
real 0m1.406s
user 0m1.391s
sys 0m0.016s
$ time ../target/release/stupid-way
0
real 0m1.404s
user 0m1.391s
sys 0m0.016s
$ time ../target/release/stupid-way
0
real 0m1.404s
user 0m1.391s
sys 0m0.016s
$ time ../target/release/stupid-way
0
real 0m1.405s
user 0m1.391s
sys 0m0.000s
As you can see the difference is significant and reproducable all the way
Good news: I localized that with debug=true, opt-level=3 and overflow-checks=false the issue is reproducable in my last code. If overflow-checks=true the issue isn't reproducable. So I'm glad that I would be able to disassemble with debug-info tomorrow.
Edit: I was shocked by codegen-units affecting it.
For codegen-units=16
real 0m2.803s
user 0m2.797s
sys 0m0.000s
For codegen-units=1
real 0m20.675s
user 0m20.656s
sys 0m0.016s
10 times performance drop!
I tried it too on my Linux notebook. If I use codegen-units=1
, both binaries are completely equal (incremental compilation doesn't seem to make a difference). With the default setting there is indeed a difference between the compiled assembly.
With the constant it tests directly against zero, with the variable it compares the value with a register which holds the variable (this is unlikely to cause the difference, but it probably triggered a different codegen for something else).
However, since it produces equivalent code with codegen-units=1
, I'm not sure if this is an actual issue.
@Kobzol try this snippet of main.rs:
pub fn max_subarray_bad(arr: &[i32]) -> (usize, usize, i32)
{
let zro: i32 = 0;
let prefixes = arr
.iter()
.enumerate()
.scan((0, 0), |s, (i, v)| {
if s.1 > zro {
s.1 = s.1 + *v;
} else {
*s = (i, *v);
}
Some(*s)
});
let (right_idx, (left_idx, sum)) = prefixes
.enumerate()
.max_by_key(|&(_, (_, sum))| sum)
.unwrap();
(left_idx, right_idx + 1, sum)
}
fn main() {
const N: usize = 1000000;
let v = vec![0; N];
const loops: usize = 2000;
let mut total_sum = 0;
for i in 0..loops {
let (_, _, sum) = max_subarray_bad(&v);
total_sum += sum;
}
println!("{}", total_sum);
}
With:
[profile.dev]
opt-level=3
debug=true
overflow-checks=false
codegen-units=1
and then set
codegen-units=16
Run the binary with time command, you'll see how drastically execution time changes without editing the source code. Thats clearly a bug
Well yes, when you change the number of codegen units, you are affecting the optimizer (more units = more parallelization, but possibly worse code). So if you increase the unit number (or use the default), the generated code can be slower, that is to be expected. For maximum performance you should use codegen-units=1
anyway.
@Kobzol the codegen-units=16 results in a code that runs in 2-3 seconds, the codegen-units=1 results in a code that runs in 20+ seconds. How this can be a maximum performance?
On my system I get the following numbers (in release
):
Variant | Codegen units | Time |
---|---|---|
let | 1 | 0.9 s |
let | 16 | 0.9 s |
const | 1 | 0.9 s |
const | 16 | 1.4 s |
Ok, so what about
[profile.dev]
opt-level=3
debug=true
overflow-checks=false
codegen-units=16
[profile.dev]
opt-level=3
debug=true
overflow-checks=false
codegen-units=1
Or no guarantees can be asserted for debug=true performance?
debug=true
AFAIK generates line numbers + runtime debug information (frame pointers etc.), that shouldn't have a large effect here. To bring dev
to the same parameters as release
, you'd have to also set debug-assertions
to false and incremental
to false (this seems to have the largest effect here for some reason). But then I don't see any point of using the dev
profile.
If you just want to get debug information in release, use
[profile.release]
debug=true
codegen-units=1
Can you fill and send the table I sent on your computer in release
?
Can you fill and send the table I sent on your computer in release?
I can, but it has no point for effort, I just checked, my results has the same ratios as yours, just different magnitude.
So you're right it's not a bug at all, it's just me as newbie that didn't know about codegen-units=1
. Well, I'm sorry that took your time on investigation of things that work as designed. So should I close this issue myself?
I would leave it open for now - I'm not from the rustc team, I just wanted to find out what causes the difference so that we have a better idea of the real problem. Maybe some of the team members might see it differently and indeed consider this a bug :)
Ha, in any case thank you, I learned some new things about rust, which is good, at least for me :) Well, I'll agree with you and let rustc team judge about what to do with this issue further.
Today I found that I was sidetracked yesterday and lost few important things. I returned back to the setup repo for this issue and some things still bother me.
cargo bench
results in worse results with codegen-units=1
than with codegen-units=16
Namely for codegen-units=1
on my machine:
test tests::benchmark_good ... bench: 3,291,420 ns/iter (+/- 19,735)
And for codegen-units=16
on my machine:
test tests::benchmark_good ... bench: 2,616,990 ns/iter (+/- 47,830)
But things start work as intended if I try to isolate things in release binary. So the question for today is either bench isn't performance reliable as intended or these things can happen on best optimized release too.
P.S. @Kobzol what do you think about inverse performance dependency on codegen-units for bench?
So on my PC codegen units here doesn't really affect anything for the bench
, however the good
case is consistently about 30 % faster (I tried different code alignments and changed the order, still stays the same). And indeed the assembly of the two functions is different.
What I think happens is that when the code was in the binary, with codegen-units=1
it was able to inline the function better and chose a different codegen, which was faster. With codegen-units=16
it didn't do that. For the benchmark, it simply calls the functions (it's a library in this case, so it didn't fully inline them) and since they have different assembly, they have different performance.
What I don't understand though is why is the code the same on godbolt, but different on my PC (using the same compiler version and hopefully the same flags). If I put #[inline(never)]
before the closure in scan
, both versions behave the same, maybe the fact that the closure references the variable somehow affects inlining and further codegen.
This is quite unexpected since more codegen units often lead to worse assembly: #47745
What I don't understand though is why is the code the same on godbolt
Godbolt is using -C emit=asm
option which AFAIK forces 1 codegen unit.
Yes I think you're right, I also saw a forum post a while ago where someone was surprised that with -C emit=asm
the code suddenly got faster. Yet even with codegen-units=1
the code between the two versions is different locally when the code is compiled as a library (and also different from godbolt).
I created a workspace setup, where binary depends on library (not sure about what kind of linkage actually happens there, but its a default one for workspaces). So, the results are simillar to cargo bench
: codegen-units=16
provides a better performance for "good" function than codegen-units=1
. But, thankfully for @mati865 and the issue he mentioned, lto=true
setting provides even better performance in good+codegen-units=1
case (than c.u.=16) and "bad" has the same optimal performance as "good" . Unfortunately it doesn't work well in cargo bench
case, it still sucks in performance with codegen-units=1
, but it's good to know that there is a workaround here. I should try different linkage types later.
Triage: There was a "maybe we should close this issue?" discussion, and I think we should go ahead and close this now. Especially #68632 (comment) indicates we can close this. Keeping it open is unlikely to lead to anything.