chmln/sd

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT

zamazan4ik opened this issue · 0 comments

Hi!

Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects - the results are available here. So that's why I think it's worth trying to apply PGO to sd. I already performed some benchmarks and want to share my results here.

Test environment

  • Fedora 38
  • Linux kernel 6.5.5
  • AMD Ryzen 9 5900x
  • 48 Gib RAM
  • SSD Samsung 980 Pro 2 Tib
  • Compiler - Rustc 1.73
  • sd version: the latest for now from the master branch on commit efb8198a4268cb6c74468e42ec3446cc1cd5b92c

Benchmark setup

As a test file, I use this large enough JSON file. sd is tested with this command line: sd -p "(\w+)" "\$1\$1" dump.json > /dev/null. I took these arguments from the issue #52 . For PGO profile collection the same arguments and test file were used.

PGO optimization is done with cargo-pgo.

All benchmarks are done multiple times, on the same hardware/software setup, with the same background "noise" (as much I can guarantee ofc).

Results

I got the following results:

hyperfine --warmup 10 --min-runs 100 'sd_release -p "(\w+)" "\$1\$1" dump.json > /dev/null' 'sd_pgo_optimized -p "(\w+)" "\$1\$1" dump.json > /dev/null'
Benchmark 1: sd_release -p "(\w+)" "\$1\$1" dump.json > /dev/null
  Time (mean ± σ):     916.7 ms ±  21.3 ms    [User: 881.4 ms, System: 33.1 ms]
  Range (min … max):   875.5 ms … 1032.8 ms    100 runs

Benchmark 2: sd_pgo_optimized -p "(\w+)" "\$1\$1" dump.json > /dev/null
  Time (mean ± σ):     745.3 ms ±   9.4 ms    [User: 710.3 ms, System: 33.1 ms]
  Range (min … max):   713.1 ms … 782.3 ms    100 runs

Summary
  sd_pgo_optimized -p "(\w+)" "\$1\$1" dump.json > /dev/null ran
    1.23 ± 0.03 times faster than sd_release -p "(\w+)" "\$1\$1" dump.json > /dev/null

Just for reference, sd in the Instrumentation mode (during the PGO profile collection) has the following results (in time format):

time sd_pgo_instrumented -p "(\w+)" "\$1\$1" dump.json > /dev/null
sd_pgo_instrumented -p "(\w+)" "\$1\$1" dump.json  1,49s user 0,04s system 99% cpu 1,534 total

At least according to the simple benchmark above, PGO has a measurable positive effect on sd performance.

Further steps

I can suggest the following things to do:

  • Evaluate PGO's applicability to sd in more scenarios.
  • If PGO helps to achieve better performance - add a note to sd's documentation about that (probably somewhere in the README file). In this case, users and maintainers will be aware of another optimization opportunity for sd.
  • Provide PGO integration into the build scripts. It can help users and maintainers easily apply PGO for their own workloads.
  • Optimize prebuilt binaries with PGO.

Here are some examples of how PGO is already integrated into other projects' build scripts:

After PGO, I can suggest evaluating LLVM BOLT as an additional optimization step after PGO.