rust-lang/rust

Slower performance caused only by using LTO

Opened this issue · 15 comments

The Computer Language Benchmarks Game was on the Rust subreddit recently and while I checked out the numbers for Rust, I noticed that the Rust solution for the fasta benchmark is much slower than the C version, although they work fairly similarly, the multithreading in the C version is based on the Rust version. It turned out that the Rust benchmarks are compiled with LTO by default, and when I tested the code on my machine without LTO (both stable and nightly), it was almost as fast as the C version. I tried to find an existing issue, but most of them are about slow compilation, not slow runtime.

The interesting thing is that on the CPU monitor graph it's clearly visible that during the last part of the benchmark all CPU cores are only on 70% usage (so it's like a mutex is locked for too long). I also checked the binary size, it went down from 4.4 MB to 3.1 MB with LTO.

EDIT: I also tested it with the Mutex from parking_lot, it's still slow with LTO, but without it's a tiny bit faster than the C version.

Could this be related to thinlto + multiple codegen units?

Please try with -C codegen-units=1.

With -C codegen-units=1 the LTO version is the fastest.

Yep, this is a known issue with multiple codegen units on release builds, which were enabled in the latest release: #47745
At the moment the solution to this, as you saw, is setting the codegen units to 1 (either via rustc args or cargo.toml).

Closing as a duplicate.

I saw that issue but I thought that ThinLTO works differently than regular LTO, so I opened this one.

Compiling with cargo +nightly rustc --release -- -Clto=thin and with cargo +nightly build --release produces similar results as the C version.

Compiling with cargo +nightly rustc --release -- -Clto=fat gives the slow numbers (as seen in the actual benchmark). So this is not caused by ThinLTO.

Another example with a pretty significant slowdown:

extern crate rand;

use rand::{Rng,SeedableRng,XorShiftRng};

fn main() {
    let mut rng:XorShiftRng = SeedableRng::from_seed([1,2,3,4]);
    let mut m:f64 = 0.0;
    for _ in 0..1_000_000_000 {
        let x = rng.gen();
        if x > m { m = x; }
    }
    println!("{}", m);
}

Without lto: 3.065 seconds
With lto: 5.116 seconds
With lto and codegen-units=1: 3.106 seconds

@nagisa I noticed that you added this issue to the list in #47745, but as I showed it in my previous comment, I didn't have a problem with ThinLTO, but with regular/fat LTO. If the underlying issue is really the same, can you note that fat LTO can also cause slowdown with multiple codegen units? And if it's a different issue, can you please reopen this issue (or find the appropriate one)? Thanks!

Removed it.

I agree that this bug should be reopened. My performance regression is also only with traditional "fat" LTO.

Why are multiple code-gens be using by default anyway? I thought they were only going to be used with thin LTO..

EDIT:
Ah, I didn't realize that thin LTO was also enabled by default. I suppose if thin LTO always results in run-time performance as good as fat LTO, then this is fine. But if fat LTO will still be useful in the future, maybe it should force codegen-units=1.

FWIW #47866 was another issue where multiple codegen units + fat lto produced worse code.

@ollie27 I didn't find that issue since it was already closed when I opened this, but this is probably the same issue. The conclusion in #47866 was that it's just how fat LTO works, unfortunately I'm not the one that compiles the code in my case, the best I can do is to convince the maintainers of the benchmark game to compile with codegen-units=1.

I won't close this issue yet since @robsmith11's code is pretty small, so it might be good for further investigation.

As I mentioned in my previous comment, I think the solution to this is that fat LTO should default to codegen-units=1, not 16 or whatever the new default is. Fat LTO isn't designed for good run-time performance with multiple codegen-units, only thin LTO is.

Triage; we've changed the defaults around this a bunch of times, but I'm not sure what they are today.

oech3 commented

I think the solution to this is that fat LTO should default to codegen-units=1, not 16

There is a case that =1 cause performance regression: #148670

Is there any idea about best codegen-units?