Evaluate using LTO, Profile-Guided Optimization (PGO) and Post Link Optimization (PLO)
zamazan4ik opened this issue · 4 comments
Hi!
Recently I checked optimizations like Link-Time Optimization (LTO), Profile-Guided Optimization (PGO) and Post-Link Optimizations (PLO) improvements on multiple projects. The results are available here. According to the tests, all these optimizations can help with achieving better performance in many cases for many applications. I think trying to enable them for Symbolicator can be a good idea.
I already did some benchmarks and want to share my results here. Hopefully, they will be helpful.
Test environment
- Fedora 39
- Linux kernel 6.5.12
- AMD Ryzen 9 5900x
- 48 Gib RAM
- SSD Samsung 980 Pro 2 Tib
- Compiler - Rustc 1.74
- Symbolicator version: the latest for now from the
master
branch on commitac127975e4649dc2f40b177cb98556e307aca26e
- Disabled Turbo boost (for better results consistency across runs)
Benchmark
For the benchmark purposes, I used this WRK-based scenario for Minidump. As a minidump, I use this Linux dump. WRK command is the same for all benchmarks and PGO/PLO training phases: WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
. Before each WRK benchmark, I once run cargo run -p process-event -- ../tests/fixtures/linux.dmp
as it's recommended.
All PGO and PLO optimizations are done with cargo-pgo (I highly recommend using this tool). For PLO phase I use LLVM BOLT tool.
LTO is enabled by the following changes to the Release
profile in the root Cargo.toml
file:
[profile.release]
# For release builds, we do want line-only debug information to be able to symbolicate panic stack traces.
debug = 1
codegen-units = 1
lto = true
For all benchmarks, binaries are stripped with a strip
tool.
All tests are done on the same machine, done multiple times, with the same background "noise" (as much as I can guarantee of course) - the results are reproducible at least on my machine.
Tricky moment with PGO dumps
For some unknown reason, Symbolicator does not dump the PGO profile to the disk on Ctrl+C. I guess it's somehow related to custom signal handling somewhere in the code. So I modified Symbolicator a little bit by manually dumping the PGO profile to the disk. As a reference implementation, I use this piece of code from YugabyteDB. I made the following changes to the main.rs
:
extern {
fn __llvm_profile_write_file();
}
use signal_hook::{consts::SIGINT, iterator::Signals};
use std::{error::Error, thread, time::Duration};
fn main() {
let mut signals = Signals::new(&[SIGINT]).unwrap();
thread::spawn(move || {
for sig in signals.forever() {
println!("Received signal {:?}", sig);
unsafe { __llvm_profile_write_file(); }
std::process::exit(0);
}
});
match cli::execute() {
Ok(()) => std::process::exit(0),
Err(error) => {
logging::ensure_log_error(&error);
std::process::exit(1);
}
}
}
I use signal_hook
dependency. Please note that __llvm_profile_write_file
symbol is linked to the program only when you build your program with PGO instrumentation (it's done automatically by the Rustc compiler). Since this, you need to disable/comment out this code during the PGO optimization phase (otherwise you get a link error).
I think there should be a better way to implement this logic but for the tests' purposes, it's good enough.
Results
Here I post the benchmark results for the following Symbolicator configurations:
- Release build (the default build)
- Release + LTO
- Release + LTO + PGO optimized
- Release + LTO + PGO optimized + PLO optimized (with LLVM BOLT)
Release:
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
10 threads and 50 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 18.60ms 10.37ms 41.83ms 57.72%
Req/Sec 269.77 45.57 434.00 67.50%
80627 requests in 30.02s, 470.12MB read
Requests/sec: 2685.82
Transfer/sec: 15.66MB
Release + LTO:
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
10 threads and 50 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 16.09ms 8.95ms 35.94ms 57.91%
Req/Sec 312.10 43.03 440.00 70.03%
93266 requests in 30.03s, 543.81MB read
Requests/sec: 3106.16
Transfer/sec: 18.11MB
Release + LTO + PGO optimized:
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
10 threads and 50 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 13.71ms 7.63ms 31.15ms 57.75%
Req/Sec 366.18 51.08 545.00 67.53%
109422 requests in 30.02s, 638.01MB read
Requests/sec: 3644.55
Transfer/sec: 21.25MB
Release + LTO + PGO optimized + PLO optimized:
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
10 threads and 50 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 12.82ms 7.14ms 30.18ms 57.86%
Req/Sec 391.64 58.28 555.00 62.60%
117034 requests in 30.02s, 682.40MB read
Requests/sec: 3897.93
Transfer/sec: 22.73MB
According to the tests above, I see measurable improvements from enabling LTO, PGO and PLO with LLVM BOLT.
Additionally, below I post results for the PGO instrumentation and PLO instrumentation phases. So you can estimate the Symbolicator slowdown during the instrumentation.
Release + LTO + PGO instrumentation:
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
10 threads and 50 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 19.10ms 10.63ms 42.48ms 57.75%
Req/Sec 262.82 44.15 454.00 68.50%
78545 requests in 30.02s, 457.98MB read
Requests/sec: 2616.14
Transfer/sec: 15.25MB
Release + LTO + PGO optimized + PLO instrumentation:
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
10 threads and 50 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 22.72ms 12.66ms 50.54ms 57.68%
Req/Sec 220.82 40.15 380.00 66.73%
66012 requests in 30.02s, 384.90MB read
Requests/sec: 2198.81
Transfer/sec: 12.82MB
Further steps
I can suggest the following action points:
- Perform more PGO benchmarks on Symbolicator. If it shows improvements - add a note to the documentation about possible improvements in Symbolicator performance with PGO.
- Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize Symbolicator according to their workloads.
- Optimize pre-built Symbolicator binaries
- Evaluate other Sentry products for LTO/PGO/PLO applicability
Here are some examples of how PGO optimization is integrated in other projects:
- Rustc: a CI script for the multi-stage build
- GCC:
- Clang: Docs
- Python:
- Go: Bash script
- V8: Bazel flag
- ChakraCore: Scripts
- Chromium: Script
- Firefox: Docs
- Thunderbird has PGO support too
- PHP - Makefile command and old Centminmod scripts
- MySQL: CMake script
- YugabyteDB: GitHub commit
- FoundationDB: Script
- Zstd: Makefile
- Foot: Scripts
- Windows Terminal: GitHub PR
- Pydantic-core: GitHub PR
- file.d: GitHub PR
- OceanBase: CMake flag
I have some examples of how PGO information looks in the documentation:
- ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-performance/profile-guided-optimization
- Databend: https://databend.rs/doc/contributing/pgo
- Vector: https://vector.dev/docs/administration/tuning/pgo/
- Nebula: https://docs.nebula-graph.io/3.5.0/8.service-tuning/enable_autofdo_for_nebulagraph/
- GCC: Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- Clang:
- tsv-utils: https://github.com/eBay/tsv-utils/blob/master/docs/BuildingWithLTO.md
Regarding LLVM BOLT integration, I have the following examples:
- Rustc:
- CPython: GitHub PR
- YDB: GitHub comment
- Clang:
- LDC: GitHub comment
- HHVM, Proxygen and others: Facebook paper
- NodeJS: Blog
- Chromium: Blog
- MySQL, MongoDB, memcached, Verilator: Paper
Wow, this is a very thorough analysis, thank you for doing this!
I was previously reading all the Rust related posts enabling LTO/PGO/PLO for the compiler itself with great interest.
It looks like you get about ~15% for each of the steps, which is very impressive. I would be curious what the impacts on build times are. Especially with PGO/PLO, you have to build once with instrumentation and then a second time, so build times are at least 2x.
We have to find a reasonable middle ground here. As we are building Docker images for every commit on master
. And we would like to have reasonable turnaround times in case we might have to deploy some quick fixes, though that is rarely the case.
Then there is the question of getting some halfway complete and realistic testcases to run on instrumented builds, which should include:
- Minidump processing with a realistic minidump
- Native Symbolication (which is part of minidump processing anyway)
- Debug File processing (PE, PDB, ELF, Mach-O, DWARF, wasm)
- SourceMap processing with realistic SourceMaps and artifact bundles.
Curating a good corpus of tests with edgecases would be challenging, and hooking that up to the build as well.
A side not purely on benchmarking:
There is symbolicator-stress
, which runs just in-process and does not require an external tool like wrk
to push payloads to it.
So if it is just about measuring potential wins, I would recommend that tool.
But yes, in the end, you would want to build/instrument/optimize the final http server, so in the end you would want to use that.
Again, huge thanks for doing all these measurements. At the very least LTO should provide us with a good 10-15% win with reasonable effort. The wins for PGO/PLO look amazing as well, but the effort to get there is a lot higher so I’m not sure yet it if would be worth doing.
Especially with PGO/PLO, you have to build once with instrumentation and then a second time, so build times are at least 2x.
is it possible to build once with instrumentation to produce a profile, and then build the release with the profile of a previous commit (and just disabling PGO if there's none)? it would allow you to parallelize.
We have to find a reasonable middle ground here. As we are building Docker images for every commit on master. And we would like to have reasonable turnaround times in case we might have to deploy some quick fixes, though that is rarely the case.
Yep. This could be mitigated by building heavily-optimized Docker images (LTO + PGO + PLO) not for every commit, but let's say every night or something like that. Of course, some changes to the current CI pipelines will be needed in this case. That's just an idea of how it can be implemented.
The wins for PGO/PLO look amazing as well, but the effort to get there is a lot higher so I’m not sure yet it if would be worth doing.
I agree. If you decide to not invest much in enabling PGO/PLO in your build scripts, I can suggest at least writing a note somewhere in the Symbolicator documentation about possible opportunities to increase the Symbolicator performance with PGO/PLO. In this case, users who want to optimize Symbolicator as much as possible for their use cases and are not afraid of building their Symbolicator binaries from the source code will be aware of such optimizations.
is it possible to build once with instrumentation to produce a profile, and then build the release with the profile of a previous commit (and just disabling PGO if there's none)? it would allow you to parallelize.
Yes, it's possible to do. Just be aware that in the case of PGO profile caching you will face a "profile skew" problem then the PGO profile could be a bit outdated and not optimize some new code (since the profile is created for the older source code). In general, it's not critical if the source code is not changing much (so there will be no huge difference between the commits). Another possible trap - changing compiler options. If you change a compiler option or upgrade the compiler version, it's highly recommended to regenerate the PGO profiles since they with a huge probability will be incompatible.
We now build Symbolicator docker images with 1CGU + LTO. We needed to throw a beefier cloudbuild machine at it unfortunately. Build times suffered a little bit in general as well.
Though the results on CPU usage in production were underwhelming. I expected to see some improvements in the graph, but nothing is visible. I expected to see at least a tiny dent in the graphs but there was nothing.
So I will call this closed, thank you very much for doing the initial assessment on this 🎉