evilsocket/legba

Evaluate using LTO, Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) like LLVM BOLT

zamazan4ik opened this issue · 1 comments

Hi!

Recently I checked Profile-Guided Optimization (PGO) improvements on multiple projects. The results are here. According to my tests, PGO helps with achieving better performance in many application domains, including the network-oriented software (e.g. see the results for Envoy, HAProxy, httpd). Since this, I decided to test PGO on Legba. And here are my results.

Test environment

  • Fedora 38
  • Linux kernel 6.5.6
  • AMD Ryzen 9 5900x
  • 48 Gib RAM
  • SSD Samsung 980 Pro 2 Tib
  • Compiler - Rustc 1.73
  • Legba version: the latest for now from the main branch on commit 5f0739a974f4ad92c254ddfe37aca033b40600e6
  • Disabled Turbo boost

Benchmark

For benchmark purposes, I use "HTTP basic auth" scenario from the test_server directory with the legba http.basic -t 127.0.0.1:8888 --username admin666 --password ./passwords_1m.txt --concurrency 1 command line. concurrency 1 is used just for reducing multithreading jitter influence on the results. As password_1m.txt file I use this where test12345 password is moved to the end of the file.

For the training PGO phase, I use completely the same command but with a smaller password file (1050 passwords + test12345 at the end) (just to boost the PGO training phase).

I tested the following Legba configurations:

  • Release build: cargo build --release
  • Release + lto = true + codegen-units = 1 (enable LTO): Apply LTO changes to Cargo.toml and then cargo build --release
  • Release + lto = true + codegen-units = 1 + PGO: cargo pgo build + cargo pgo optimize build. It's done with cargo-pgo.
  • Release + lto = true + codegen-units = 1 + PGO + BOLT: Also via cargo-pgo

All benchmarks are done multiple times, on the same machine (with the same hardware/software configuration), with the same background noise (as much as I can guarantee ofc).

Results

I got the following results:

  • Release: 276s
  • Release + lto = true + codegen-units = 1: 262s
  • Release + lto = true + codegen-units = 1 + PGO optimized: 247s
  • Release + lto = true + codegen-units = 1 + PGO optimized + BOLT optimized: 247s

At least in the benchmark above, LTO and PGO help with achieving better performance in Legba. However, seems like LLVM BOLT has no measurable results in this benchmark.

For reference, here are results for the smaller file with 1051 password, so you can estimate how slower PGO instrumented Legba is compared to other configurations:

  • Release: 273ms
  • Release + LTO: 261ms
  • Release + LTO + PGO instrumented: 311ms
  • Release + LTO + PGO optimized + BOLT instrumented: 300ms

Here are binary sizes after the strip command:

  • Release: 21 Mib
  • Release + LTO: 17 Mib
  • Release + LTO + PGO instrumented: 53 Mib
  • Release + LTO + PGO optimized: 15 Mib
  • Release + LTO + PGO optimized + BOLT instrumented: 68 Mib
  • Release + LTO + PGO optimized + BOLT optimized: 20 Mib

Also, I measured build time changes between configurations:

  • Release: 3m 10s
  • Release + lto = true + codegen-units = 1: 6m 57s
  • Release + lto = true + codegen-units = 1 + PGO instrumented: 11m 14s
  • Release + lto = true + codegen-units = 1 + PGO optimized: 6m 40s

Further steps

I can suggest the following action points:

  • Perform more PGO benchmarks on Legba in various scenarios. If it shows improvements - add a note to the documentation about possible improvements in legba's performance with PGO.
  • Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize legba according to their own workloads.

Here are some examples of how PGO optimization is integrated in other projects:

@zamazan4ik thank you for such useful insights! I have to admit i didn't know about PGO and BOLT, so I'll have to study a bit before being able to make any meaningful changes to the build system.