fhanau/Efficient-Compression-Tool

Prebuilt binary with PGO here

kkocdko opened this issue · 8 comments

Update 20240902: use this newer version then run ./zcodecs ect xxx.

Profile-Guided Optimizations enabled.

[kkocdko@klf misc]$ ./hyperfine -w 1 -r 5 './ect_flto -5 1.png 2.png 3.png '
Benchmark 1: ./ect_flto -5 1.png 2.png 3.png 
  Time (mean ± σ):      5.400 s ±  0.011 s    [User: 5.351 s, System: 0.035 s]
  Range (min … max):    5.389 s …  5.411 s    5 runs
 
[kkocdko@klf misc]$ ./hyperfine -w 1 -r 5 './ect_flto_pgo -5 1.png 2.png 3.png '
Benchmark 1: ./ect_flto_pgo -5 1.png 2.png 3.png 
  Time (mean ± σ):      4.481 s ±  0.014 s    [User: 4.428 s, System: 0.042 s]
  Range (min … max):    4.469 s …  4.503 s    5 runs
 
[kkocdko@klf misc]$ 

The real result depends on your workload.

#120

For anyone who might stumble upon this:
The addition of a x86-64 micro architecture level can squeeze out some more performance, depending upon the compression level and hardware capabilities.

Benchmark 1 = plain build
Benchmark 2 = the binary linked above
Benchmark 3 = ltoed, pgoed and x86-64-v3 leveled build
Benchmark 4 = ltoed, pgoed and x86-64-v4 leveled build

Benchmark 1: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      3.139 s ±  0.014 s    [User: 3.102 s, System: 0.031 s]
  Range (min … max):    3.120 s …  3.159 s    5 runs
 
Benchmark 2: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      2.911 s ±  0.013 s    [User: 2.878 s, System: 0.026 s]
  Range (min … max):    2.895 s …  2.926 s    5 runs
 
Benchmark 3: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      2.865 s ±  0.005 s    [User: 2.828 s, System: 0.030 s]
  Range (min … max):    2.858 s …  2.871 s    5 runs
 
Benchmark 4: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      2.880 s ±  0.002 s    [User: 2.843 s, System: 0.030 s]
  Range (min … max):    2.878 s …  2.882 s    5 runs
 
Summary
  mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 /tmp/tst; rm -rf /tmp/tst ran
    1.01 ± 0.00 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 /tmp/tst; rm -rf /tmp/tst
    1.02 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert /tmp/tst; rm -rf /tmp/tst
    1.10 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect /tmp/tst; rm -rf /tmp/tst

At default settings the difference is neglegible, if that is all you use, don't bother.

Benchmark 1: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect -5 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      6.439 s ±  0.096 s    [User: 6.389 s, System: 0.037 s]
  Range (min … max):    6.334 s …  6.548 s    5 runs
 
Benchmark 2: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert -5 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      5.213 s ±  0.017 s    [User: 5.164 s, System: 0.037 s]
  Range (min … max):    5.193 s …  5.230 s    5 runs
 
Benchmark 3: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 -5 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      4.358 s ±  0.016 s    [User: 4.307 s, System: 0.040 s]
  Range (min … max):    4.340 s …  4.379 s    5 runs
 
Benchmark 4: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 -5 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      4.258 s ±  0.010 s    [User: 4.208 s, System: 0.040 s]
  Range (min … max):    4.251 s …  4.276 s    5 runs
 
Summary
  mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 -5 /tmp/tst; rm -rf /tmp/tst ran
    1.02 ± 0.00 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 -5 /tmp/tst; rm -rf /tmp/tst
    1.22 ± 0.00 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert -5 /tmp/tst; rm -rf /tmp/tst
    1.51 ± 0.02 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect -5 /tmp/tst; rm -rf /tmp/tst

Does almost as much as adding pgo did.

Benchmark 1: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect -9 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):     65.767 s ±  0.164 s    [User: 65.578 s, System: 0.052 s]
  Range (min … max):   65.602 s … 66.035 s    5 runs
 
Benchmark 2: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert -9 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):     43.676 s ±  0.030 s    [User: 43.521 s, System: 0.052 s]
  Range (min … max):   43.637 s … 43.711 s    5 runs
 
Benchmark 3: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 -9 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):     27.658 s ±  0.162 s    [User: 27.531 s, System: 0.056 s]
  Range (min … max):   27.488 s … 27.927 s    5 runs
 
Benchmark 4: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 -9 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):     25.154 s ±  0.079 s    [User: 25.034 s, System: 0.054 s]
  Range (min … max):   25.058 s … 25.256 s    5 runs
 
Summary
  mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 -9 /tmp/tst; rm -rf /tmp/tst ran
    1.10 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 -9 /tmp/tst; rm -rf /tmp/tst
    1.74 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert -9 /tmp/tst; rm -rf /tmp/tst
    2.61 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect -9 /tmp/tst; rm -rf /tmp/tst

Quite a bump, shaves off at least 16 seconds and more than halves the time when compared to the plain build.

@ghtm2 Could you provide your binary? In that day, I tested the avx256 and avx512 build but it run even slower in my machine (AMD R5 5600U {zen3}). If enable avx will faster it's quiet a big bump! And, which CPU is used in your benchmark?

@ghtm2 Hi, did you have nasm installed while building the binary?

@ghtm2 Could you provide your binary? In that day, I tested the avx256 and avx512 build but it run even slower in my machine (AMD R5 5600U {zen3}). If enable avx will faster it's quiet a big bump! And, which CPU is used in your benchmark?

Sure, here are the v3 and v4 binaries: ect.tar.gz
You'll need at least glibc 2.38 installed though.
The CPU used is a AMD Ryzen 7 7840U, so Zen 4.

@ghtm2 Hi, did you have nasm installed while building the binary?

Yes.

@ghtm2 Awesome! Your binary is much faster, how did you do that? I append -march=x86-64-v3 -mavx2 here, but it's even slower, increase my benchmark from 48s to 1m27s, and your ect_v3 binary is 26s.

if (CMAKE_CXX_COMPILER_ID STREQUAL "GNU" OR CMAKE_CXX_COMPILER_ID STREQUAL "Clang"
OR CMAKE_CXX_COMPILER_ID STREQUAL "AppleClang" OR CMAKE_CXX_COMPILER_ID STREQUAL "ARMClang")
if(CPU_TYPE STREQUAL "x86_64" OR CPU_TYPE STREQUAL "i386")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -mpclmul -msse4.2")
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -mpclmul -msse4.2")

And, my whole build script here, I ran build with llvm-19, did you use GCC?:

https://github.com/clevert-app/clevert/blob/main/.github/workflows/asset_zcodecs.yml#L171

I really, really want to replicate your success.

I objdump your binary, GCC 14.2.1?

I reproduced your benchmark. It's faster using GCC instead of Clang. I will try to tweak it more. Thank you!

Sorry for the glacial response times, I'm quite busy at the moment.

Yes, I've build it with GCC 14.2.1 as that is what's currently shipped on Arch.
I can also confirm, that Clang produces noticeably slower ect binaries, no matter the flags.

I've made a small howto to reproduce the build for arch and derivatives: howto.tar.gz

I'm pretty sure that there is still some performance to be had with the appropriate flags and better input for PGO.
One might also want to try to further optimize with bolt, but I currently don't have the time to try.