Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT
zamazan4ik opened this issue · 0 comments
Hi!
Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects - the results are available here. So that's why I think it's worth trying to apply PGO to sd
. I already performed some benchmarks and want to share my results here.
Test environment
- Fedora 38
- Linux kernel 6.5.5
- AMD Ryzen 9 5900x
- 48 Gib RAM
- SSD Samsung 980 Pro 2 Tib
- Compiler - Rustc 1.73
sd
version: the latest for now from themaster
branch on commitefb8198a4268cb6c74468e42ec3446cc1cd5b92c
Benchmark setup
As a test file, I use this large enough JSON file. sd
is tested with this command line: sd -p "(\w+)" "\$1\$1" dump.json > /dev/null
. I took these arguments from the issue #52 . For PGO profile collection the same arguments and test file were used.
PGO optimization is done with cargo-pgo.
All benchmarks are done multiple times, on the same hardware/software setup, with the same background "noise" (as much I can guarantee ofc).
Results
I got the following results:
hyperfine --warmup 10 --min-runs 100 'sd_release -p "(\w+)" "\$1\$1" dump.json > /dev/null' 'sd_pgo_optimized -p "(\w+)" "\$1\$1" dump.json > /dev/null'
Benchmark 1: sd_release -p "(\w+)" "\$1\$1" dump.json > /dev/null
Time (mean ± σ): 916.7 ms ± 21.3 ms [User: 881.4 ms, System: 33.1 ms]
Range (min … max): 875.5 ms … 1032.8 ms 100 runs
Benchmark 2: sd_pgo_optimized -p "(\w+)" "\$1\$1" dump.json > /dev/null
Time (mean ± σ): 745.3 ms ± 9.4 ms [User: 710.3 ms, System: 33.1 ms]
Range (min … max): 713.1 ms … 782.3 ms 100 runs
Summary
sd_pgo_optimized -p "(\w+)" "\$1\$1" dump.json > /dev/null ran
1.23 ± 0.03 times faster than sd_release -p "(\w+)" "\$1\$1" dump.json > /dev/null
Just for reference, sd
in the Instrumentation mode (during the PGO profile collection) has the following results (in time
format):
time sd_pgo_instrumented -p "(\w+)" "\$1\$1" dump.json > /dev/null
sd_pgo_instrumented -p "(\w+)" "\$1\$1" dump.json 1,49s user 0,04s system 99% cpu 1,534 total
At least according to the simple benchmark above, PGO has a measurable positive effect on sd
performance.
Further steps
I can suggest the following things to do:
- Evaluate PGO's applicability to
sd
in more scenarios. - If PGO helps to achieve better performance - add a note to sd's documentation about that (probably somewhere in the README file). In this case, users and maintainers will be aware of another optimization opportunity for sd.
- Provide PGO integration into the build scripts. It can help users and maintainers easily apply PGO for their own workloads.
- Optimize prebuilt binaries with PGO.
Here are some examples of how PGO is already integrated into other projects' build scripts:
- Rustc: a CI script for the multi-stage build
- GCC:
- Clang: Docs
- Python:
- Go: Bash script
- V8: Bazel flag
- ChakraCore: Scripts
- Chromium: Script
- Firefox: Docs
- Thunderbird has PGO support too
- PHP - Makefile command and old Centminmod scripts
- MySQL: CMake script
- YugabyteDB: GitHub commit
- FoundationDB: Script
- Zstd: Makefile
- Foot: Scripts
- Windows Terminal: GitHub PR
- Pydantic-core: GitHub PR
After PGO, I can suggest evaluating LLVM BOLT as an additional optimization step after PGO.