Evaluate using Profile-Guided Optimization (PGO) for the compiler

Question

Evaluate using Profile-Guided Optimization (PGO) for the compiler

Opened this issue 4 months ago · 2 comments

Is your feature request related to a problem? Please describe.
Not a problem - just an idea of how the compiler's performance can be improved.

Describe alternatives you've considered
Leave the things as is.

Describe the solution you'd like
I checked Profile-Guided Optimization (PGO) on many projects (including compilers) - all the results are available at https://github.com/zamazan4ik/awesome-pgo . Since this, I decided to test PGO with the Bend compiler. Below are the numbers.

Test environment

Fedora 40
Linux kernel 6.8.10
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.77.0-nightly
Bend version: the latest for now from the main branch on commit 0b6c3350a4e556df665ef146ac6d04e6b73bbae9
Disabled Turbo boost

Benchmark

I decided to perform PGO benchmarks on a simple scenario - bend check sorter.bend (sorter.bend took from here). For PGO optimization I use cargo-pgo tool. Release build is done with cargo build --release, PGO instrumented with cargo pgo build, PGO optimized - cargo pgo optimize build, BOLT instrumented (with PGO optimization) - cargo pgo bolt build --with-pgo, BOLT optimized - cargo pgo bolt optimize --with-pgo. The training workload is the same for PGO and PLO - bend check sorter.bend.

taskset -c 0 is used for reducing the OS scheduler's influence on the results during the benchmarks. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).

Results

I got the following results:

hyperfine --warmup 200 --min-runs 1000 -i -N 'taskset -c 0 ./bend_release check sorter.bend' 'taskset -c 0 ./bend_optimized check sorter.bend' 'taskset -c 0 ./bend_pgo_and_bolt_optimized check sorter.bend'
Benchmark 1: taskset -c 0 ./bend_release check sorter.bend
  Time (mean ± σ):       4.4 ms ±   0.1 ms    [User: 3.0 ms, System: 1.3 ms]
  Range (min … max):     4.3 ms …   4.7 ms    1000 runs

Benchmark 2: taskset -c 0 ./bend_optimized check sorter.bend
  Time (mean ± σ):       3.8 ms ±   0.1 ms    [User: 2.5 ms, System: 1.3 ms]
  Range (min … max):     3.7 ms …   4.1 ms    1000 runs

Benchmark 3: taskset -c 0 ./bend_pgo_and_bolt_optimized check sorter.bend
  Time (mean ± σ):       3.8 ms ±   0.1 ms    [User: 2.4 ms, System: 1.2 ms]
  Range (min … max):     3.7 ms …   4.1 ms    1000 runs

Summary
  taskset -c 0 ./bend_pgo_and_bolt_optimized check sorter.bend ran
    1.01 ± 0.02 times faster than taskset -c 0 ./bend_optimized check sorter.bend
    1.18 ± 0.03 times faster than taskset -c 0 ./bend_release check sorter.bend

where:

bend_release - Release + LTO build
bend_optimized - Release + LTO + PGO optimized
bend_pgo_and_bolt_optimized - Release + LTO + PGO optimized + BOLT optimized

According to the results, PGO measurably improves the compiler's performance at least in the simple benchmark above. However, PLO (with LLVM BOLT) doesn't bring measurable improvements at least in this scenario.

For reference, I also measured the slowdown from PGO and PLO instrumentation:

hyperfine --warmup 100 --min-runs 500 -i -N 'taskset -c 0 ./bend_release check sorter.bend' 'taskset -c 0 ./bend_instrumented check sorter.bend' 'taskset -c 0 ./bend_optimized_bolt_instrumented check sorter.bend'
Benchmark 1: taskset -c 0 ./bend_release check sorter.bend
  Time (mean ± σ):       4.4 ms ±   0.1 ms    [User: 3.1 ms, System: 1.2 ms]
  Range (min … max):     4.3 ms …   5.3 ms    676 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: taskset -c 0 ./bend_instrumented check sorter.bend
  Time (mean ± σ):       6.5 ms ±   0.1 ms    [User: 4.2 ms, System: 2.1 ms]
  Range (min … max):     6.3 ms …   6.8 ms    500 runs

Benchmark 3: taskset -c 0 ./bend_optimized_bolt_instrumented check sorter.bend
  Time (mean ± σ):     157.7 ms ±   5.1 ms    [User: 30.4 ms, System: 108.6 ms]
  Range (min … max):   147.7 ms … 259.6 ms    500 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./bend_release check sorter.bend ran
    1.47 ± 0.03 times faster than taskset -c 0 ./bend_instrumented check sorter.bend
   35.57 ± 1.30 times faster than taskset -c 0 ./bend_optimized_bolt_instrumented check sorter.bend

where:

bend_release - Release + LTO build
bend_instrumented - Release + LTO + PGO instrumented
bend_optimized_bolt_instrumented - Release + LTO + PGO optimized + BOLT instrumented

Also, I measured binary size changes in different compilation modes (without stripping):

Release + LTO: 2.6 Mib
Release + LTO + PGO instrumented: 8.2 Mib
Release + LTO + PGO optimized: 4.6 Mib
Release + LTO + PGO optimized + BOLT instrumented: 20 Mib
Release + LTO + PGO optimized + BOLT optimized: 8.6 Mib

Further steps

I can suggest the following action points:

Perform more PGO benchmarks with other datasets (if you are interested enough in it). If it shows improvements - add a note to the documentation (the README file, I guess) about possible improvements in the compiler's performance with PGO.
Probably, you can try to get some insights about how the code can be optimized further based on the changes that the compiler performed with PGO. It can be done via analyzing flamegraphs before and after applying PGO to understand the difference.
More intensive testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO). However, I recommend starting from the usual PGO since it's a much more stable technology with much fewer limitations.

I would be happy to answer your questions about PGO and PLO.

By the way, it can be an interesting idea to evaluate PGO optimization applicability to the Bend program itself, not just for the compiler and similar tools (formatters, etc.).

P.S. Please do not treat the issue like a bug or something like that - it's just an improvement idea. Since the "Discussions" functionality is disabled in this repo, I created the Issue instead.

Answer 1 · 2024-05-27T16:53:07.000Z

I had no idea that this was a builtin option in the Rust compiler, very cool.

Some thoughts about this:

Adding PGO to the Bend build looks very promissing, probably will be done at some point.
Bend goes out of the way to have as few dependencies as possible, especially at its core. So I doubt link-time optimizations will improve too much.
The Bend compiler is NOT optimized, almost at all. Our focus so far has been on correctness and simplicity. Before investigating compile optimizations, we'll have to profile and optimize the logic of Bend itself, which I imagine will bring much greater improvements.
I'm much more interested in the performance this can bring to the runtime. We should try doing this to the C runtime. I imagine that for CUDA, PGO is not really a thing?

Answer 2 · 2024-05-27T17:11:48.000Z

Bend goes out of the way to have as few dependencies as possible, especially at its core. So I doubt link-time optimizations will improve too much.

Probably - at least it should improve the binary size. This optimization is already enabled (so I changed nothing for this in the build scripts).

The Bend compiler is NOT optimized, almost at all. Our focus so far has been on correctness and simplicity. Before investigating compile optimizations, we'll have to profile and optimize the logic of Bend itself, which I imagine will bring much greater improvements.

Yep. However, even after performing manual optimizations with a higher probability PGO still will be useful. All modern "mature" compilers like Clang, Rustc, GCC, and many others still have large performance benefits from enabling PGO for them.

I'm much more interested in the performance this can bring to the runtime. We should try doing this to the C runtime. I imagine that for CUDA, PGO is not really a thing?

It depends. If we talk about GPU code - yes, PGO is not a thing yet for it. If we talk about the CPU part - PGO should work as for regular non-CUDA code.