ouch-org/ouch

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT

zamazan4ik opened this issue · 0 comments

Hi!

Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects - the results are available here. So that's why I think it's worth trying to apply PGO to Ouch. I already performed some benchmarks and want to share my results here.

Test environment

  • Fedora 38
  • Linux kernel 6.5.5
  • AMD Ryzen 9 5900x
  • 48 Gib RAM
  • SSD Samsung 980 Pro 2 Tib
  • Compiler - Rustc 1.73
  • Ouch version: the latest for now from the main branch on commit dc21932102011da61a85a98f43d9d8d9ab6bd917
  • Disabled Turbo boost

Benchmark setup

For benchmarking purposes, I use these benchmarks - https://github.com/ouch-org/ouch/blob/main/benchmarks/run-benchmarks.sh . Release build is done with cargo build --release, PGO optimized build is done with cargo-pgo. PGO profiles are collected from the benchmark workload itself.

All benchmarks are done multiple times, on the same hardware/software setup, with the same background "noise" (as much I can guarantee ofc).

Results

ouch_release - Release build, ouch_optimized - Release + PGO build.

I got the following results:

./run-benchmarks.sh
Benchmark 1: ./ouch_release compress rust output.tar
  Time (mean ± σ):     781.0 ms ±   3.9 ms    [User: 119.2 ms, System: 649.2 ms]
  Range (min … max):   772.3 ms … 789.9 ms    50 runs

Benchmark 2: ./ouch_optimized compress rust output.tar
  Time (mean ± σ):     759.7 ms ±   7.0 ms    [User: 104.1 ms, System: 643.2 ms]
  Range (min … max):   732.5 ms … 784.5 ms    50 runs

Summary
  ./ouch_optimized compress rust output.tar ran
    1.03 ± 0.01 times faster than ./ouch_release compress rust output.tar
Creating tar archive to benchmark decompression...
Benchmark 1: ./ouch_release decompress input.tar --dir output
  Time (mean ± σ):      3.138 s ±  0.022 s    [User: 0.339 s, System: 2.725 s]
  Range (min … max):    3.103 s …  3.239 s    50 runs

Benchmark 2: ./ouch_optimized decompress input.tar --dir output
  Time (mean ± σ):      3.091 s ±  0.014 s    [User: 0.312 s, System: 2.704 s]
  Range (min … max):    3.063 s …  3.134 s    50 runs

Summary
  ./ouch_optimized decompress input.tar --dir output ran
    1.02 ± 0.01 times faster than ./ouch_release decompress input.tar --dir output
Benchmark 1: ./ouch_release compress compiler output.tar.gz
  Time (mean ± σ):      70.5 ms ±   2.6 ms    [User: 729.9 ms, System: 62.0 ms]
  Range (min … max):    66.5 ms …  79.9 ms    50 runs

Benchmark 2: ./ouch_optimized compress compiler output.tar.gz
  Time (mean ± σ):      68.8 ms ±   2.3 ms    [User: 727.0 ms, System: 62.3 ms]
  Range (min … max):    64.6 ms …  76.3 ms    50 runs

Summary
  ./ouch_optimized compress compiler output.tar.gz ran
    1.02 ± 0.05 times faster than ./ouch_release compress compiler output.tar.gz
Creating tar.gz archive to benchmark decompression...
Benchmark 1: ./ouch_release decompress input.tar.gz --dir output
  Time (mean ± σ):     255.9 ms ±   4.0 ms    [User: 82.4 ms, System: 173.9 ms]
  Range (min … max):   251.7 ms … 273.4 ms    50 runs

Benchmark 2: ./ouch_optimized decompress input.tar.gz --dir output
  Time (mean ± σ):     254.8 ms ±   2.9 ms    [User: 79.2 ms, System: 175.4 ms]
  Range (min … max):   250.6 ms … 263.6 ms    50 runs

Summary
  ./ouch_optimized decompress input.tar.gz --dir output ran
    1.00 ± 0.02 times faster than ./ouch_release decompress input.tar.gz --dir output
Benchmark 1: ./ouch_optimized compress compiler output.zip
  Time (mean ± σ):     523.7 ms ±   1.4 ms    [User: 474.3 ms, System: 46.8 ms]
  Range (min … max):   521.4 ms … 530.8 ms    50 runs

Benchmark 2: ./ouch_release compress compiler output.zip
  Time (mean ± σ):     527.0 ms ±   2.5 ms    [User: 479.2 ms, System: 45.1 ms]
  Range (min … max):   524.2 ms … 535.9 ms    50 runs

Summary
  ./ouch_optimized compress compiler output.zip ran
    1.01 ± 0.01 times faster than ./ouch_release compress compiler output.zip
Creating zip archive to benchmark decompression...
Benchmark 1: ./ouch_release decompress input.zip --dir output
  Time (mean ± σ):     241.0 ms ±   2.0 ms    [User: 84.2 ms, System: 157.6 ms]
  Range (min … max):   238.7 ms … 249.3 ms    50 runs

Benchmark 2: ./ouch_optimized decompress input.zip --dir output
  Time (mean ± σ):     243.5 ms ±   3.1 ms    [User: 84.6 ms, System: 158.6 ms]
  Range (min … max):   236.7 ms … 253.0 ms    50 runs

Summary
  ./ouch_release decompress input.zip --dir output ran
    1.01 ± 0.02 times faster than ./ouch_optimized decompress input.zip --dir output

check results at results.md

According to the tests, it's possible to achieve several percent improvements with PGO at least in these benchmarks.

Further steps

I can suggest the following things to do:

  • Evaluate PGO's applicability to Ouch in more scenarios.
  • If PGO helps to achieve better performance - add a note to Ouch's documentation about that (probably somewhere in the README file). In this case, users and maintainers will be aware of another optimization opportunity for Ouch.
  • Provide PGO integration into the build scripts. It can help users and maintainers easily apply PGO for their own workloads.
  • Optimize prebuilt binaries with PGO.

Here are some examples of how PGO is already integrated into other projects' build scripts:

After PGO, I can suggest evaluating LLVM BOLT as an additional optimization step after PGO.