rust-lang/rust

Strange perforamnce drops with const literals in closures

idkravitz opened this issue ยท 25 comments

I asked about it on user.rust-lang.org and one user suggested that it may be a bug in optimizer indeed. So I should state that the things below are reproducable both in nightly and stable latest rust on x86_64-pc-windows-msvc triplet and x86_64-unknown-linux-gnu (tested inside WSL).

So our test subject would be the following snippet, that solves Max subarray problem

pub fn max_subarray_bad(arr: &[i32]) -> (usize, usize, i32)
{
    let prefixes = arr
        .iter()
        .enumerate()
        .scan((0, 0), |s, (i, v)| {
            if s.1 > 0 {
                s.1 = s.1 + *v;
            } else {
                *s = (i, *v);
            }
            Some(*s)
        });
    let (right_idx, (left_idx, sum)) = prefixes
        .enumerate()
        .max_by_key(|&(_, (_, sum))| sum)
        .unwrap();
    
    (left_idx, right_idx + 1, sum)
}

If we benchmark it with criterion crate with this benchmark code:

use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn benchmark_linear(c: &mut Criterion) {
    const N: usize = 1000000;
    c.bench_function(&format!("max_subarray([..]) N = {}", N), |b| {
        b.iter(|| max_subarray::max_subarray_bad(black_box(&vec![0; N])))
    });
}

criterion_group!(benches, benchmark_linear);
criterion_main!(benches);

Then the output of cargo bench on my machine would be

Benchmarking max_subarray([..]) N = 1000000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 16.6s or reduce sample count to 30.
max_subarray([..]) N = 1000000
                        time:   [3.2324 ms 3.2700 ms 3.3141 ms]
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

     Running target\release\deps\scratch-b68a42551ab01289.exe

But with the slight change of moving out 0 in expression s.1 > 0 in let binding outside of the closure can make a great difference. So the function is now this:

pub fn max_subarray_bad(arr: &[i32]) -> (usize, usize, i32)
{
    let zro = 0;
    let prefixes = arr
        .iter()
        .enumerate()
        .scan((0, 0), |s, (i, v)| {
            if s.1 > zro {
                s.1 = s.1 + *v;
            } else {
                *s = (i, *v);
            }
            Some(*s)
        });
    let (right_idx, (left_idx, sum)) = prefixes
        .enumerate()
        .max_by_key(|&(_, (_, sum))| sum)
        .unwrap();
    
    (left_idx, right_idx + 1, sum)
}

But cargo bench output indicates almost 20% performance gain!

Benchmarking max_subarray([..]) N = 1000000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.9s or reduce sample count to 40.
max_subarray([..]) N = 1000000
                        time:   [2.5705 ms 2.5806 ms 2.5913 ms]
                        change: [-20.260% -19.668% -19.124%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

     Running target\release\deps\scratch-b68a42551ab01289.exe

You can check that changing function back and forth with replacing 0 and zro in that expression indeed results in 20% performance change.

By the way, if we change let zro = 0 into const zro: i32 = 0 it results in performance drop too.

It looks like a bug in optimizer for me. Could someone verify it?

OT hint - if you append triple backticks with a word rust like ```rust, github will highlight syntax ๐Ÿ˜‰

OT hint - if you append triple backticks with a word rust like ```rust, github will highlight syntax

Thanks, I was wondering whats wrong with syntax highlight ๐Ÿ‘

I also just checked it inside WSL with stable and nightly -x86_64-unknown-linux-gnu and the results were 33% performance difference on my machine.

It's not the optimizer: https://godbolt.org/z/LcLYaw

You can try disabling incremental compilation and reducing codegen-units to 1 to see if you can still reproduce it.

How weird indeed, if I disable incremental and set codegen-units to 1 both versions run in slowest time, but if I leave values at default the one with let runs in 20% faster.
In a case of incremental=false codegen-units=1:

Benchmarking max_subarray([..]) N = 1000000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 16.7s or reduce sample count to 30.
max_subarray([..]) N = 1000000
                        time:   [3.3078 ms 3.3099 ms 3.3130 ms]
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe

Benchmarking max_subarray([..]) N = 1000000 #2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 16.6s or reduce sample count to 30.
max_subarray([..]) N = 1000000 #2
                        time:   [3.2961 ms 3.2976 ms 3.2992 ms]
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

If I comment out profile settings in Cargo.toml, the output is:

Benchmarking max_subarray([..]) N = 1000000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 17.1s or reduce sample count to 30.
max_subarray([..]) N = 1000000
                        time:   [3.3745 ms 3.3775 ms 3.3808 ms]
                        change: [+1.8443% +2.0076% +2.1392%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

Benchmarking max_subarray([..]) N = 1000000 #2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 13.4s or reduce sample count to 40.
max_subarray([..]) N = 1000000 #2
                        time:   [2.6242 ms 2.6262 ms 2.6286 ms]
                        change: [-20.425% -20.281% -20.087%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

So the weird thing is, that I located that code-units=16 actually allow to produce more effective code for bench than code-units=1 (for the one with let bind)

BTW, rm -rf ./target doesn't change anything.
I made a setup crate for this issue, you can check it here

P.S. or it may be a bug in a criterion I didn't double checked the runtime with clock :)

It also might not be a bug at all, benchmarking such things is incredible tricky, you might get hit by code alignment issues or just sheer bad luck. When I benchmark the first function that you posted multiple times in a row, I sometimes get around 10 % differences (by running the SAME code). And as @mati685 posted, it seems to generate exactly the same assembly, as it should.

Really the only way to check what happens is for you to disassemble the two binaries produced on your system and check that the assembly is indeed the same. If it is, it's not a rustc issue :-)

Try compiling with debug symbols and then dissassemble the binaries for example with objdump -S -l to see if the assembly instructions match between the two versions.

Btw, the absolute same issue happens if I throw away criterion and use built-in test::Bencher (by same I mean it responds to code-units setting)

running 2 tests
tests::benchmark_bad  ... bench:   3,293,805 ns/iter (+/- 17,375)
tests::benchmark_good ... bench:   3,309,870 ns/iter (+/- 46,277)                                                                                                                                                                    
test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out                                                                                                                                                                
C:\Users\idkra\dev\rust\issue-68632>cargo bench                                                                         
Compiling issue-68632 v0.1.0 (C:\Users\idkra\dev\rust\issue-68632)
    Finished bench [optimized] target(s) in 0.51s
     Running target\release\deps\issue_68632-80a8ccd9df83eb66.exe

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running target\release\deps\bench_builtin-30e90d201571b52f.exe

running 2 tests
test tests::benchmark_bad  ... bench:   3,375,950 ns/iter (+/- 32,356)
test tests::benchmark_good ... bench:   2,596,490 ns/iter (+/- 14,938)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

Try compiling with debug symbols and then dissassemble the binaries for example with objdump -S -l to see if the assembly instructions match between the two versions.

Well, its past-midnight in my TZ, so I'll dig into it tomorrow.

P.S.
I have tested this issue on two machines running Windows 10 and both reproduced it.

So as a last thing today, I made a simple binary crate with the following main.rs:

pub fn max_subarray_bad(arr: &[i32]) -> (usize, usize, i32)
{
    let zro: i32 = 0; // substitute let with const for slow version and vice versa
    let prefixes = arr
        .iter()
        .enumerate()
        .scan((0, 0), |s, (i, v)| {
            if s.1 > zro {
                s.1 = s.1 + *v;
            } else {
                *s = (i, *v);
            }
            Some(*s)
        });
    let (right_idx, (left_idx, sum)) = prefixes
        .enumerate()
        .max_by_key(|&(_, (_, sum))| sum)
        .unwrap();

    (left_idx, right_idx + 1, sum)
}

fn main() {
    const N: usize = 1000000;
    let v = vec![0; N];
    const loops: usize = 1000;
    let mut total_sum = 0;
    for i in 0..loops {
        let (_, _, sum) = max_subarray_bad(&v);
        total_sum += sum;
    }
    println!("{}", total_sum);
}

Then I compiled two versions with let and without and run time ./my_binary a few times. Here are my results:

$ time ../target/release/stupid-way
0

real    0m2.198s
user    0m2.188s
sys     0m0.000s
$ time ../target/release/stupid-way
0

real    0m2.195s
user    0m2.203s
sys     0m0.016s
$ time ../target/release/stupid-way
0

real    0m2.192s
user    0m2.188s
sys     0m0.000s
$ time ../target/release/stupid-way
0

real    0m2.188s
user    0m2.172s
sys     0m0.031s
$ time ../target/release/stupid-way
0

real    0m2.191s
user    0m2.172s
sys     0m0.000s

And for variant with let:

$ time ../target/release/stupid-way
0

real    0m1.409s
user    0m1.406s
sys     0m0.000s
$ time ../target/release/stupid-way
0

real    0m1.406s
user    0m1.391s
sys     0m0.016s
$ time ../target/release/stupid-way
0

real    0m1.404s
user    0m1.391s
sys     0m0.016s
$ time ../target/release/stupid-way
0

real    0m1.404s
user    0m1.391s
sys     0m0.016s
$ time ../target/release/stupid-way
0

real    0m1.405s
user    0m1.391s
sys     0m0.000s

As you can see the difference is significant and reproducable all the way

Good news: I localized that with debug=true, opt-level=3 and overflow-checks=false the issue is reproducable in my last code. If overflow-checks=true the issue isn't reproducable. So I'm glad that I would be able to disassemble with debug-info tomorrow.

Edit: I was shocked by codegen-units affecting it.
For codegen-units=16

real    0m2.803s
user    0m2.797s
sys     0m0.000s

For codegen-units=1

real    0m20.675s
user    0m20.656s
sys     0m0.016s

10 times performance drop!

I tried it too on my Linux notebook. If I use codegen-units=1, both binaries are completely equal (incremental compilation doesn't seem to make a difference). With the default setting there is indeed a difference between the compiled assembly.

With the constant it tests directly against zero, with the variable it compares the value with a register which holds the variable (this is unlikely to cause the difference, but it probably triggered a different codegen for something else).

However, since it produces equivalent code with codegen-units=1, I'm not sure if this is an actual issue.

@Kobzol try this snippet of main.rs:

pub fn max_subarray_bad(arr: &[i32]) -> (usize, usize, i32)
{
    let zro: i32 = 0;
    let prefixes = arr
        .iter()
        .enumerate()
        .scan((0, 0), |s, (i, v)| {
            if s.1 > zro {
                s.1 = s.1 + *v;
            } else {
                *s = (i, *v);
            }
            Some(*s)
        });
    let (right_idx, (left_idx, sum)) = prefixes
        .enumerate()
        .max_by_key(|&(_, (_, sum))| sum)
        .unwrap();

    (left_idx, right_idx + 1, sum)
}



fn main() {
    const N: usize = 1000000;
    let v = vec![0; N];
    const loops: usize = 2000;
    let mut total_sum = 0;
    for i in 0..loops {
        let (_, _, sum) = max_subarray_bad(&v);
        total_sum += sum;
    }
    println!("{}", total_sum);
}

With:

[profile.dev]
opt-level=3
debug=true
overflow-checks=false
codegen-units=1

and then set

codegen-units=16

Run the binary with time command, you'll see how drastically execution time changes without editing the source code. Thats clearly a bug

Well yes, when you change the number of codegen units, you are affecting the optimizer (more units = more parallelization, but possibly worse code). So if you increase the unit number (or use the default), the generated code can be slower, that is to be expected. For maximum performance you should use codegen-units=1 anyway.

@Kobzol the codegen-units=16 results in a code that runs in 2-3 seconds, the codegen-units=1 results in a code that runs in 20+ seconds. How this can be a maximum performance?

On my system I get the following numbers (in release):

Variant Codegen units Time
let 1 0.9 s
let 16 0.9 s
const 1 0.9 s
const 16 1.4 s

Ok, so what about

[profile.dev]
opt-level=3
debug=true
overflow-checks=false
codegen-units=16
[profile.dev]
opt-level=3
debug=true
overflow-checks=false
codegen-units=1

Or no guarantees can be asserted for debug=true performance?

debug=true AFAIK generates line numbers + runtime debug information (frame pointers etc.), that shouldn't have a large effect here. To bring dev to the same parameters as release, you'd have to also set debug-assertions to false and incremental to false (this seems to have the largest effect here for some reason). But then I don't see any point of using the dev profile.

If you just want to get debug information in release, use

[profile.release]
debug=true
codegen-units=1

Can you fill and send the table I sent on your computer in release?

Can you fill and send the table I sent on your computer in release?

I can, but it has no point for effort, I just checked, my results has the same ratios as yours, just different magnitude.

So you're right it's not a bug at all, it's just me as newbie that didn't know about codegen-units=1. Well, I'm sorry that took your time on investigation of things that work as designed. So should I close this issue myself?

I would leave it open for now - I'm not from the rustc team, I just wanted to find out what causes the difference so that we have a better idea of the real problem. Maybe some of the team members might see it differently and indeed consider this a bug :)

Ha, in any case thank you, I learned some new things about rust, which is good, at least for me :) Well, I'll agree with you and let rustc team judge about what to do with this issue further.

Today I found that I was sidetracked yesterday and lost few important things. I returned back to the setup repo for this issue and some things still bother me.

cargo bench results in worse results with codegen-units=1 than with codegen-units=16
Namely for codegen-units=1 on my machine:

test tests::benchmark_good ... bench:   3,291,420 ns/iter (+/- 19,735)

And for codegen-units=16 on my machine:

test tests::benchmark_good ... bench:   2,616,990 ns/iter (+/- 47,830)

But things start work as intended if I try to isolate things in release binary. So the question for today is either bench isn't performance reliable as intended or these things can happen on best optimized release too.

P.S. @Kobzol what do you think about inverse performance dependency on codegen-units for bench?

So on my PC codegen units here doesn't really affect anything for the bench, however the good case is consistently about 30 % faster (I tried different code alignments and changed the order, still stays the same). And indeed the assembly of the two functions is different.

What I think happens is that when the code was in the binary, with codegen-units=1 it was able to inline the function better and chose a different codegen, which was faster. With codegen-units=16 it didn't do that. For the benchmark, it simply calls the functions (it's a library in this case, so it didn't fully inline them) and since they have different assembly, they have different performance.

What I don't understand though is why is the code the same on godbolt, but different on my PC (using the same compiler version and hopefully the same flags). If I put #[inline(never)] before the closure in scan, both versions behave the same, maybe the fact that the closure references the variable somehow affects inlining and further codegen.

This is quite unexpected since more codegen units often lead to worse assembly: #47745

What I don't understand though is why is the code the same on godbolt

Godbolt is using -C emit=asm option which AFAIK forces 1 codegen unit.

Yes I think you're right, I also saw a forum post a while ago where someone was surprised that with -C emit=asm the code suddenly got faster. Yet even with codegen-units=1 the code between the two versions is different locally when the code is compiled as a library (and also different from godbolt).

I created a workspace setup, where binary depends on library (not sure about what kind of linkage actually happens there, but its a default one for workspaces). So, the results are simillar to cargo bench: codegen-units=16 provides a better performance for "good" function than codegen-units=1. But, thankfully for @mati865 and the issue he mentioned, lto=true setting provides even better performance in good+codegen-units=1 case (than c.u.=16) and "bad" has the same optimal performance as "good" . Unfortunately it doesn't work well in cargo bench case, it still sucks in performance with codegen-units=1, but it's good to know that there is a workaround here. I should try different linkage types later.

Triage: There was a "maybe we should close this issue?" discussion, and I think we should go ahead and close this now. Especially #68632 (comment) indicates we can close this. Keeping it open is unlikely to lead to anything.