ggerganov/whisper.cpp

Benchmark results

ggerganov opened this issue · 162 comments

Encoder

Collection of bench results for various platforms and devices.
If you want to submit info about your device, simply run the bench tool or the extra/bench-all.sh and report the results in the comments below.

Suggestions for better summary of the results are welcome

CPU OS Config Model Th Load Enc. Commit
MacBook M1 Pro MacOS 13.0.1 NEON BLAS tiny 8 71 102 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS base 8 96 220 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 8 233 685 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS medium 8 603 1928 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS large 8 1158 3350 206fc93
---
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 1 251 2605 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 4 255 884 206fc93
---
Mac Mini M1 MacOS NEON BLAS tiny 4 62 194 fcf515d
Mac Mini M1 MacOS NEON BLAS base 4 81 380 fcf515d
Mac Mini M1 MacOS NEON BLAS small 4 204 1249 fcf515d
Mac Mini M1 MacOS NEON BLAS medium 4 876 3980 fcf515d
Mac Mini M1 MacOS NEON BLAS large 4 1876 7979 fcf515d
---
Ryzen 9 3900X Ubuntu 20.04 AVX2 tiny 8 107 422 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 base 8 137 880 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 small 8 280 2874 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 medium 8 692 9610 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 large 8 1317 16917 fcf515d
---
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS tiny 4 120 780 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS base 4 151 1173 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS small 4 289 3062 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS medium 4 711 9175 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS large 4 1282 16050 fcf515d
---
Ryzen 9 5950X Ubuntu 22.04 AVX2 tiny 8 135 197 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 base 8 176 421 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 small 8 357 1393 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 medium 8 855 4404 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 large 8 1576 8118 fcf515d
---
Raspberry Pi 4 NEON tiny 4 1436 13839 fcf515d
Raspberry Pi 4 NEON base 4 1894 30552 fcf515d
---
iPhone 13 Mini iOS 16.0 NEON BLAS base 4 97 1091 fcf515d
---
MacBook M1 Pro Vivaldi WASM tiny 8 133 3785 fcf515d
MacBook M1 Pro Vivaldi WASM base 8 172 8253 fcf515d
---
MacBook M1 Pro Chrome WASM tiny 8 134 3776 fcf515d
MacBook M1 Pro Chrome WASM base 8 168 8200 fcf515d
---
MacBook M1 Pro Firefox WASM tiny 8 137 2626 fcf515d
MacBook M1 Pro Firefox WASM base 8 183 6226 fcf515d

memcpy

MacBook M1 Pro

./bench -w 1 -t 1
memcpy: 37.59 GB/s

Ryzen 9 5950X

./bench -w 1 -t 1
memcpy: 16.74 GB/s

ggml_mul_mat

MacBook M1 Pro

./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16    330.6 GFLOPS (128 runs) / F32    466.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16    737.5 GFLOPS (128 runs) / F32    838.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    938.6 GFLOPS (128 runs) / F32   1062.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16   1312.5 GFLOPS (128 runs) / F32   1835.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1765.1 GFLOPS (128 runs) / F32   2041.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1784.3 GFLOPS (104 runs) / F32   1859.2 GFLOPS (109 runs)
ggml_mul_mat:  4096 x  4096: F16   1855.1 GFLOPS ( 14 runs) / F32   1873.3 GFLOPS ( 14 runs)

Ryzen 9 5950X

WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     56.3 GFLOPS (128 runs) / F32     70.2 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     47.8 GFLOPS (128 runs) / F32     67.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    185.1 GFLOPS (128 runs) / F32    332.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    386.4 GFLOPS (128 runs) / F32    658.6 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    636.2 GFLOPS (128 runs) / F32   1012.0 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    950.9 GFLOPS ( 56 runs) / F32   1296.8 GFLOPS ( 76 runs)
ggml_mul_mat:  4096 x  4096: F16   1168.6 GFLOPS (  9 runs) / F32   1403.1 GFLOPS ( 11 runs)

Results for Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-4790K Debian   tiny.en 4 165 808
i7-4790K Debian   tiny.en 8 165 783
i7-4790K Debian   base.en 4 212 1813
i7-4790K Debian   base.en 8 214 1746

Results for Ryzen 5 4500U 6C/6T laptop CPU (I've just included one result for 8 threads as Encode time is much higher when threads > CPU cores).

CPU OS Config Model Threads Load [ms] Encode [ms]
Ryzen 5 4500U (6C/6T) Opensuse Leap tiny.en 4 170.00 829.43
Ryzen 5 4500U (6C/6T) Opensuse Leap tiny.en 6 143.03 671.74
Ryzen 5 4500U (6C/6T) Opensuse Leap base.en 4 305.92 2,092.39
Ryzen 5 4500U (6C/6T) Opensuse Leap base.en 6 188.05 1,495.61
Ryzen 5 4500U (6C/6T) Opensuse Leap small.en 4 408.03 6,919.31
Ryzen 5 4500U (6C/6T) Opensuse Leap small.en 6 359.23 6,370.83
Ryzen 5 4500U (6C/6T) Opensuse Leap medium.en 4 2,238.11 25,863.28
Ryzen 5 4500U (6C/6T) Opensuse Leap medium.en 6 1,113.04 19,672.63
Ryzen 5 4500U (6C/6T) Opensuse Leap medium.en 8 973.65 39,619.20
CPU OS Config Model Threads Load [ms] Encode [ms]
i7-11800H WSL2 Ubuntu AVX2 tiny 2 164.35 1087.61
i7-11800H WSL2 Ubuntu AVX2 tiny 4 128.94 733.24
i7-11800H WSL2 Ubuntu AVX2 tiny 8 137.57 619.88
i7-11800H WSL2 Ubuntu AVX2 AVX512 tiny 2 143.02 1087.15
i7-11800H WSL2 Ubuntu AVX2 AVX512 tiny 4 127.60 730.57
i7-11800H WSL2 Ubuntu AVX2 AVX512 tiny 8 125.62 616.27
i7-11800H WSL2 Ubuntu AVX2 AVX512 BLAS tiny 2 132.59 1511.38
i7-11800H WSL2 Ubuntu AVX2 AVX512 BLAS tiny 4 132.48 1407.49
i7-11800H WSL2 Ubuntu AVX2 AVX512 BLAS tiny 8 133.82 1458.27
CPU OS Config Model Threads Load [ms] Encode [ms]
i7-11800H WSL2 Ubuntu AVX2 base 2 174.34 2533.79
i7-11800H WSL2 Ubuntu AVX2 base 4 166.68 1830.67
i7-11800H WSL2 Ubuntu AVX2 base 8 165.53 1478.73
i7-11800H WSL2 Ubuntu AVX2 small 2 340.12 8714.24
i7-11800H WSL2 Ubuntu AVX2 small 4 394.32 6021.41
i7-11800H WSL2 Ubuntu AVX2 small 8 305.98 4828.84
i7-11800H WSL2 Ubuntu AVX2 large 2 3205.36 57109.10
i7-11800H WSL2 Ubuntu AVX2 large 4 2720.25 38519.89
i7-11800H WSL2 Ubuntu AVX2 large 8 3716.34 27739.99
CPU OS Config Model Threads Load [ms] Encode [ms]
i7-11800H WSL2 Ubuntu AVX2 AVX512 large 2 1954.21 54966.84
i7-11800H WSL2 Ubuntu AVX2 AVX512 large 4 1455.40 37320.62
i7-11800H WSL2 Ubuntu AVX2 AVX512 large 8 1372.58 27937.64

This performance is impressing!

M1 Pro | MacOS |   | large | 8 | 1973 | 4208

This performance is impressing!

Yes, there is a huge performance boost due to using the built-in BLAS implementation on these devices. I will soon add OpenBLAS support for x86 architectures and see how this compares.

By the way, AVX-512 is not supported on master. I have added initial support here, but I am not sure if it works: #95

CPU OS Config Model Threads Load[ms] encode[ms]
Intel® Core™ i5-8250U Win11 Home AVX2 Large 8 2226.85 61547.61

compiled with MinGW64 gcc 11.3

Valve Jupiter (AMD Custom APU 0405, Zen 2 microarch, 4c8t, 16GB DDR5 @ 5200 MT/s)

CPU OS Config Model Threads Load[ms] encode[ms]
AMD Custom APU 0405 SteamOS 3.2 AVX2 Base 8 326.32 2592.96

Compiled with cc (GCC) 11.3.0

The performance gains on jfk.wav since last test (two weeks or so ago) are extremely impressive, ~10-20x speedup from 40 to 2-4 seconds.

CPU OS Config Model Threads Load [ms] Encode [ms]
MacBook M1 Max macOS Ventura BLAS small 1 299.09 4166.00
MacBook M1 Max macOS Ventura BLAS small 4 329.45 1304.32
MacBook M1 Max macOS Ventura BLAS base 1 139.10 1302.17
MacBook M1 Max macOS Ventura BLAS base 4 135.96 399.45

On a AMD EPYC 64 core 240 threads cloud instance it is stuck like this with 240 threads. I noticed that above a certain number of threads its slow, or the cloud provider is cpu limiting. Can anyone else with real hardware check if this is the case?

time ./main -m models/ggml-base.en.bin -f elon.wav -t 240
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 670.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 240 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing 'elon.wav' (34466688 samples, 2154.2 sec), 240 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ..

So I have tried with the above mentioned cloud provider various number of threads.

I found that anything above 64 threads gets slower and usable upto 120 threads. Anything above is a hang. Must be that the cloud provider is throttling on free trial or too many threads could actually slow down stuff.

...
...
processor       : 239
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 49
model name      : AMD EPYC 7742 64-Core Processor
stepping        : 0
microcode       : 0x830104d
cpu MHz         : 2245.780
cache size      : 512 KB
physical id     : 1
siblings        : 120
core id         : 59
cpu cores       : 60
apicid          : 247
initial apicid  : 247
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip rdpid arch_capabilities
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 4491.56
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:
time ./main -m models/ggml-base.en.bin -f elon.wav -t 64
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 670.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 64 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing 'elon.wav' (34466688 samples, 2154.2 sec), 64 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:03.960]   [MUSIC PLAYING]
[00:00:03.960 --> 00:00:18.240]   In life, we've seen within this part of the world
...
...
[00:35:40.320 --> 00:35:41.920]   Thank you, and have a great day.
[00:35:41.920 --> 00:35:43.920]   [APPLAUSE]
[00:35:43.920 --> 00:35:45.920]   [MUSIC PLAYING]
[00:35:45.920 --> 00:35:56.240]   [VIDEO PLAYBACK]


whisper_print_timings:     load time =   249.61 ms
whisper_print_timings:      mel time =  1267.11 ms
whisper_print_timings:   sample time =  1718.69 ms
whisper_print_timings:   encode time = 63702.25 ms / 10617.04 ms per layer
whisper_print_timings:   decode time = 381317.66 ms / 63552.94 ms per layer
whisper_print_timings:    total time = 448362.19 ms

real    7m28.411s
user    347m2.230s
sys     22m42.511s

32 threads was faster than 64 threads. I think 32 threads took around 7 minutes or so.

Env: Restricted Cloud / Throttled Maybe

CPU: AMD EPYC 7742 64-Core Processor

OS:

Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:        20.04
Codename:       focal
Linux XXXX 5.4.0-131-generic #147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Compiler:

gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.4.0-1ubuntu1~20.04.1' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-Av3uEd/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1) 
$ ./bench -m ./models/ggml-small.en.bin -t 4
whisper_model_load: loading model from './models/ggml-small.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1588.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size =    68.48 MB
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 4 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

whisper_print_timings:     load time =   515.02 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  6878.32 ms / 573.19 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  7393.42 ms

If you wish, you can submit these results here:

  https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

  - CPU model
  - Operating system
  - Compiler
$ ./bench -m ./models/ggml-small.en.bin -t 240
whisper_model_load: loading model from './models/ggml-small.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1588.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size =    68.48 MB
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 240 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

whisper_print_timings:     load time =   528.66 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 12898.34 ms / 1074.86 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time = 13427.03 ms

If you wish, you can submit these results here:

  https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

  - CPU model
  - Operating system
  - Compiler

I'll remove the above posts if too much clutter.

@trholding
Thanks for the results.

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Regarding the threads - yes, it seems that going beyond 8 threads does not help regardless of how many cores you have. My guess is that the computation is memory-bound so that's why using more threads does not improve the performance.

Okay, 8 threads max, so for a large file, is there a possibility of splitting the file to chunks with silences as terminators and dividing the conversion to ((total threads/cores)/8) but also keeping track of timestamps? This could be awesome for batch conversion.

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Oh, I didn't know, I'll update with tables soon and remove my previous comments in a few hours.

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Hey Sorry. That didn't pan out well, I did the benchmark thrice, my account got deleted without notice. Could't get the logs as it was a web terminal. On the other hand I am happy that this happened, I was giving serious thought of purchasing a GPU+CPU plan there, so performance check of CPU was equally important. Probably or technically it was my fault - probably shouldn't have used a reverse shell and done benchmarks on a free trial, but how does one know if a service is real good or all just vapor...

Dell Precision 5560 laptop results:

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-11850H Ubuntu AVX2 tiny 4 115.87 538.43
i7-11850H Ubuntu AVX2 base 4 145.14 1241.84
i7-11850H Ubuntu AVX2 small 4 299.30 4343.57
i7-11850H Ubuntu AVX2 medium 4 760.98 15238.31
i7-11850H Ubuntu AVX2 large 4 1404.32 27476.86
i7-11850H Ubuntu AVX2 tiny 8 131.96 358.81
i7-11850H Ubuntu AVX2 base 8 166.61 839.31
i7-11850H Ubuntu AVX2 small 8 320.29 2854.86
i7-11850H Ubuntu AVX2 medium 8 756.20 9829.62
i7-11850H Ubuntu AVX2 large 8 1382.38 19872.81
CPU OS Config Model Threads Load [ms] Encode [ms]
i9-9900K WSL2 Ubuntu (GCC) AVX2  tiny.en 4 85.71 601.56
i9-9900K WSL2 Ubuntu (GCC) AVX2  small.en 4 212.59 5146.23
i9-9900K OSX 10.14.1 (hackintosh - GCC) AVX2  tiny.en 4 198.17 455.12
i9-9900K OSX 10.14.1 (hackintosh - GCC) AVX2  base.en 4 272.62 909.71
i9-9900K OSX 10.14.1 (hackintosh - GCC) AVX2 small.en 4 598.75 2968.75
Xeon(R) Silver 4210R CPU @ 2.40GHz Virtual Machine - Debian Stretch (GCC - master branch) AVX2 avx512f avx512dq avx512cd avx512bw avx512vl small.en 4 776.56 12340.41
Xeon(R) Silver 4210R CPU @ 2.40GHz Virtual Machine - Debian Stretch (GCC - master branch) AVX2 avx512f avx512dq avx512cd avx512bw avx512vl tiny.en 4 295.54 1710.46
CPU OS Config Model Threads Load [ms] Encode [ms]
i9-11950H Pop!_OS 22.04 LTS AVX2 Tiny 4 124.28 656.41
i9-11950H Pop!_OS 22.04 LTS AVX2 Tiny 8 123.70 696.41
i9-11950H Pop!_OS 22.04 LTS AVX2 Base 4 159.91 1754.44
i9-11950H Pop!_OS 22.04 LTS AVX2 Base 8 164.47 1658.55
i9-11950H Pop!_OS 22.04 LTS AVX2 Small 4 330.91 6161.86
i9-11950H Pop!_OS 22.04 LTS AVX2 Small 8 346.22 5187.85
CPU OS Config Model Threads Load [ms] Encode [ms]
i7-1065G7 Windows 11 - small.en 4 1,314.25 294,168.09

Compiled with VS 2022

Something is off, right?

Yup - you are missing the AVX2 flag. See if some of the comments in #5 can help you resolve this.

OK, the AVX2 flag seems to help :)

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-1065G7 Windows 11 AVX2 small.en 4 527.59 9,648.67

Compiled with VS 2022

j1nx commented
CPU OS Config Model Threads Load [ms] Encode [ms] Remarks
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 861.34 29428.21 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 1 843.80 16145.62 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 835.68 21509.08 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 4 824.24 13187.96 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 1146.02 87615.00 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 1 1103.39 52228.30 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 1183.47 55256.20 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 4 1161.32 29851.40 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 752.64 24018.10 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 1 751.96 13082.95 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 743.37 10122.80 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 4 742.90 9564.89 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 974.46 71587.61 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 1 979.65 43852.07 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 982.24 24814.62 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 4 982.80 19910.19 Without OVOS services running

From the stream repo


CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny.en 4 243.54 ms 779.49 ms
RK3588 Ubuntu20.04 NEON base.en 4 316.52 ms 1821.06 ms
RK3588 Ubuntu20.04 NEON small.en 4 618.93 ms 7117.69 ms
RK3588 Ubuntu20.04 NEON medium.en 4 1514.88 ms 24139.92 ms
CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 4 233.86 ms 791.01 ms
RK3588 Ubuntu20.04 NEON base 4 297.93 ms 1813.69 ms
RK3588 Ubuntu20.04 NEON small 4 592.18 ms 7102.28 ms
RK3588 Ubuntu20.04 NEON medium 4 1587.36 ms 24147.87 ms
CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 8 226.48 ms 740.34 ms
RK3588 Ubuntu20.04 NEON base 8 300.48 ms 1723.42 ms
RK3588 Ubuntu20.04 NEON small 8 620.58 ms 6392.47 ms
RK3588 Ubuntu20.04 NEON medium 8 1533.75 ms 21899.08 ms

I still haven't worked out the little(0-3).Big(4-7) on this thing as if I pin to big cores taskset -c 4-7

CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny.en 4 234.14 ms 681.53 ms
RK3588 Ubuntu20.04 NEON base.en 4 297.08 ms 1679.75 ms
RK3588 Ubuntu20.04 NEON small.en 4 599.98 ms 6867.66 ms
RK3588 Ubuntu20.04 NEON medium.en 4 1492.73 ms 23600.45 ms

I tried to compile with openBlas but seemed to kill the make


From the master repo as didn't think about the repo after trying streaming input

CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 8 226.48 ms 2681.05 ms
RK3588 Ubuntu20.04 NEON base 8 283.56 ms 6132.44 ms
RK3588 Ubuntu20.04 NEON small 8 583.39 ms 24397.78 ms
RK3588 Ubuntu20.04 NEON medium 8 1490.98 85099.45 ms
CPU OS Config Model Threads Load [ms] Encode [ms]
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 tiny.en 8 136.29 454.52
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 tiny 8 134.64 486.01
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 base 8 180.22 1184.80
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 base.en 8 192.86 1197.85
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 small 8 367.55 4179.00
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 small.en 8 378.27 4557.73
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 medium 8 923.48 15552.61
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 medium.en 8 952.48 15708.63
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 large 8 1650.28 28357.09

8 threads seemed to be the fastest. However I managed to squeeze a bit more performance by pinning CPU:

$ taskset -c 0-15 ./extra/bench-all.sh 16
CPU OS Config Model Threads Load [ms] Encode [ms]
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 tiny 16 143.17 437.73
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 base 16 184.10 1061.14
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 small 16 374.41 3645.64
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 medium 16 935.45 13029.54
matth commented

Results for AWS Graviton 3 Processor (c7g.4xlarge instance type).

Compiled with -march=native -ffast-math.

./extra/bench-all.sh 8

CPU OS Config Model Threads Load [ms] Encode [ms]
Graviton 3 Ubuntu 22.04 NEON tiny 8 125.92 230.33
Graviton 3 Ubuntu 22.04 NEON base 8 160.17 547.88
Graviton 3 Ubuntu 22.04 NEON small 8 299.59 2138.86
Graviton 3 Ubuntu 22.04 NEON medium 8 741.49 6999.33
Graviton 3 Ubuntu 22.04 NEON large 8 1313.95 14174.00

./extra/bench-all.sh 16

CPU OS Config Model Threads Load [ms] Encode [ms]
Graviton 3 Ubuntu 22.04 NEON tiny 16 121.92 158.61
Graviton 3 Ubuntu 22.04 NEON base 16 156.01 386.78
Graviton 3 Ubuntu 22.04 NEON small 16 299.85 1596.38
Graviton 3 Ubuntu 22.04 NEON medium 16 750.93 5351.24
Graviton 3 Ubuntu 22.04 NEON large 16 1313.82 11115.69

@matth Do you observe significant performance difference with / without -march=native -ffast-math?

matth commented

@ggerganov -ffast-math seems to make only a very small difference that could be noise between runs

-march=native does seem to make a big difference, without it FP16_VA is not reported as being enabled (I can get this with -march=armv8.4-a+bf16+fp16fml) - I think -march=native is enabling more intrinsics than this though.

Results without any -march or -ffast-math flags ...

./extra/bench-all.sh 16

CPU OS Config Model Threads Load [ms] Encode [ms]
Graviton 3 Ubuntu 22.04 NEON tiny 16 124.25 320.53
Graviton 3 Ubuntu 22.04 NEON base 16 156.91 734.22
Graviton 3 Ubuntu 22.04 NEON small 16 301.78 2812.75
Graviton 3 Ubuntu 22.04 NEON medium 16 714.23 9139.86
Graviton 3 Ubuntu 22.04 NEON large 16 1298.33 18147.47

I have tried to improve by using OpenBlas and armpl.h but with they both slow it down considerably - I'll keep trying with the latter.

Are there any possibilities for further optimisations in ggml.c that can take advantage of the situation where you have bf16 functions but not BLAS or Accelerate?

CPU OS Config Model Threads Load [ms] Encode [ms]
E5-2640 Ubuntu 18.04 AVX2 tiny 8 235.10 1094.45
E5-2640 Ubuntu 18.04 AVX2 base 8 326.11 2307.32
E5-2640 Ubuntu 18.04 AVX2 small 8 669.31 7706.24

@matth
My experiments with OpenBLAS on x86 showed that it is not faster compared to hand-written AVX2 + FP16:
fbd513b

It seems this is also the case for Arm based on your experiments. My guess is that we don't see improvement because the computation is memory-bound and OpenBLAS works with FP32.

The reason that on Apple Silicon using CBLAS is so fast is because it utilizes the matrix co-processor which somehow is very efficient even for FP32. At least this is how I explain the results that I am seeing.

Interesting if armpl.h can provide some more insight - I haven't used it.

The most heavy stuff in ggml.c is the mul_mat_f16 and flash_attn_f16 calls. I think the conv_1d_... calls could be probably optimized more, but they are called only once as the start of the Encoder, so the improvement would be marginal.

Also, I am just looking at whisper.cpp and I realize I have forgotten why I use Flash Attention only in the Encoder and not use it also in the Decoder. Maybe this can help, because the Flash Attention reduces the memory transfers and improves cache locality.

Not sure about bf16 compared to fp16. I don't expect to provide big improvement based on quick search through some articles about the difference between the 2 data types.

Ihttps://medium.com/swlh/apples-m1-secret-coprocessor-6599492fc1e1

Gives a good write up if medium doesn't try to charge you.

https://nod.ai/comparing-apple-m1-with-amx2-m1-with-neon/

Maybe after the m3 comes out I might be able to pickup a bargain m1 mini.

I think fp16 is coming though and may help a bit

OpenMathLib/OpenBLAS#3754

PS for those of us without the secret apple sauce would implementing https://github.com/CNugteren/CLBlast be any use on integrated gpu's?

tamo commented

OpenBLAS helps Windows AMD64 MSVC

CPU OS Config Model Threads Load [ms] Encode [ms]
Ryzen 5 PRO 2400GE Windows 10 AVX2 medium 4 4259.10 116609.75
Ryzen 5 PRO 2400GE Windows 10 AVX2 BLAS medium 4 4259.58 75312.90
CPU OS Config Model Threads Load [ms] Encode [ms]
rk3588 Debian11 NEON tiny 8 232.45 2768.78
rk3588 Debian11 NEON base 8 308.36 6374.82
rk3588 Debian11 NEON small 8 626.23 25784.05
rk3588 Debian11 NEON medium 8 1667.23 86026.82
rk3588 Debian11 NEON large 8 4307.16 161328.59

CFLAGS = -I. -O3 -std=c11 -ffast-math -march=native

CPU OS Config Model Threads Load [ms] Encode [ms]
rk3588 Debian11 NEON tiny 8 230.69 2078.40
rk3588 Debian11 NEON base 8 299.10 4379.62
rk3588 Debian11 NEON small 8 621.43 18565.42
rk3588 Debian11 NEON medium 8 1532.61 65504.91
rk3588 Debian11 NEON large 8 3618.18 121710.31

If I try to compile with open blas in seperate build Encode becomes approx x2 slower so either I am doing wrong or with Armv8.2 its just bad, its -march=native that seems to make the above difference.

matth commented

Results on AWS mac2.metal instance:

CPU OS Config Model Threads Load [ms] Encode [ms]
mac2.metal OSX Ventura NEON BLAS tiny 4 64.39 184.98
mac2.metal OSX Ventura NEON BLAS base 4 87.93 368.04
mac2.metal OSX Ventura NEON BLAS small 4 198.80 1212.46
mac2.metal OSX Ventura NEON BLAS medium 4 551.49 3552.73
mac2.metal OSX Ventura NEON BLAS large 4 1042.91 6726.99

I tried disabling Accelerate and it makes a significant difference (i.e. very much slower without it!).

I assumed Accelerate was using the Neural Engine, but using both powermetrics and asitop I cannot see any utilization, both report 0mw power usage. Can anyone confirm on an M1 machine?

EDIT Possibly I was confused. Apple’s Matrix Coprocessor (AMX) and Neural Engine are different things, from @ggerganov other issues and commits it appears Accelerate might be using the former

CPU OS Config Model Threads Load [ms] Encode [ms]
i9-13900k WSL2 Ubuntu AVX2 tiny 4 58.49 360.95
i9-13900k WSL2 Ubuntu AVX2 base 4 72.44 756.48
i9-13900k WSL2 Ubuntu AVX2 small 4 154.37 2676.12
i9-13900k WSL2 Ubuntu AVX2 medium 4 393.76 8924.90
i9-13900k WSL2 Ubuntu AVX2 large 4 698.69 15862.58
i9-13900k WSL2 Ubuntu AVX2 tiny 8 55.13 291.51
i9-13900k WSL2 Ubuntu AVX2 base 8 70.93 603.33
i9-13900k WSL2 Ubuntu AVX2 small 8 141.85 1800.05
i9-13900k WSL2 Ubuntu AVX2 medium 8 356.29 5946.78
i9-13900k WSL2 Ubuntu AVX2 large 8 658.83 10868.89
CPU OS Config Model Threads Load [ms] Encode [ms]
E5-2697 V2 MacOS Monterey 12.6.1 BLAS tiny 4 301.22 872.27
E5-2697 V2 MacOS Monterey 12.6.1 BLAS base 4 405.40 1705.58
E5-2697 V2 MacOS Monterey 12.6.1 BLAS small 4 921.24 5419.73
E5-2697 V2 MacOS Monterey 12.6.1 BLAS medium 4 2356.76 15188.90
E5-2697 V2 MacOS Monterey 12.6.1 BLAS large 4 4457.29 26444.06
E5-2697 V2 MacOS Monterey 12.6.1 BLAS tiny 8 299.89 540.47
E5-2697 V2 MacOS Monterey 12.6.1 BLAS base 8 419.41 1129.01
E5-2697 V2 MacOS Monterey 12.6.1 BLAS small 8 888.64 3632.89
E5-2697 V2 MacOS Monterey 12.6.1 BLAS medium 8 2377.96 10525.92
E5-2697 V2 MacOS Monterey 12.6.1 BLAS large 8 4412.20 18933.41

Intel(R) Core(TM) i7-8750H CPU @ 2.20GH

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS tiny 4 307.20 570.86
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS base 4 406.45 1183.90
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS small 4 941.96 4156.69
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS medium 4 3124.62 13072.06
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS large 4 10090.85 36383.82
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS tiny 8 299.42 487.26
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS base 8 403.74 1113.54
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS small 8 910.07 3955.48
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS medium 8 2241.90 13076.31
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS large 8 5620.87 25562.17

Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz (12)

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-8700 Ubuntu 20.04.4 LTS AVX2 tiny 4 158.49 730.72
i7-8700 Ubuntu 20.04.4 LTS AVX2 base 4 205.93 1603.67
i7-8700 Ubuntu 20.04.4 LTS AVX2 small 4 426.62 5630.58
i7-8700 Ubuntu 20.04.4 LTS AVX2 medium 4 1080.15 18748.66
i7-8700 Ubuntu 20.04.4 LTS AVX2 large 4 1976.77 37188.47
i7-8700 Ubuntu 20.04.4 LTS AVX2 tiny 8 159.00 662.07
i7-8700 Ubuntu 20.04.4 LTS AVX2 base 8 206.62 1436.59
i7-8700 Ubuntu 20.04.4 LTS AVX2 small 8 428.20 5345.27
i7-8700 Ubuntu 20.04.4 LTS AVX2 medium 8 1108.97 16780.53
i7-8700 Ubuntu 20.04.4 LTS AVX2 large 8 1965.67 32019.44
i7-8700 Ubuntu 20.04.4 LTS AVX2 tiny 12 157.60 585.65
i7-8700 Ubuntu 20.04.4 LTS AVX2 base 12 216.74 1696.32
i7-8700 Ubuntu 20.04.4 LTS AVX2 small 12 428.51 4504.18
i7-8700 Ubuntu 20.04.4 LTS AVX2 medium 12 1081.65 15442.25
i7-8700 Ubuntu 20.04.4 LTS AVX2 large 12 1969.63 28108.55

Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz (4)

CPU OS Config Model Threads Load [ms] Encode [ms]
i3-9100F Ubuntu 20.04.4 LTS AVX2 tiny 4 164.71 726.05
i3-9100F Ubuntu 20.04.4 LTS AVX2 base 4 214.56 1806.20
i3-9100F Ubuntu 20.04.4 LTS AVX2 small 4 445.48 6613.19
i3-9100F Ubuntu 20.04.4 LTS AVX2 medium 4 1131.80 22667.64
i3-9100F Ubuntu 20.04.4 LTS AVX2 large 4 7615.74 42137.29

Intel(R) Xeon(R) CPU E3-1220 V2 @ 3.10GHz (4)

CPU OS Config Model Threads Load [ms] Encode [ms]
E3-1220 V2 Ubuntu 20.04.3 LTS tiny 4 227.41 1757.56
E3-1220 V2 Ubuntu 20.04.3 LTS base 4 297.67 3801.48
E3-1220 V2 Ubuntu 20.04.3 LTS small 4 625.18 14544.59
E3-1220 V2 Ubuntu 20.04.3 LTS medium 4 9618.55 49937.12
E3-1220 V2 Ubuntu 20.04.3 LTS large 4 40399.48 71661.48

Has anyone tried benchmarking on WASM? Seems like the encoder takes much longer time than other platform
image

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-5600U @2.60GHz Xubuntu 18.04 AVX2 BLAS ggml-tiny.en 4 258.59 2934.34
i7-5600U @2.60GHz Xubuntu 18.04 AVX2 BLAS ggml-tiny 4 255.46 2906.67
i7-5600U @2.60GHz Xubuntu 18.04 AVX2 BLAS ggml-base.en 4 316.73 6197.29
i7-5600U @2.60GHz Xubuntu 18.04 AVX2 BLAS ggml-base 4 319.93 5825.65
i7-5600U @2.60GHz Xubuntu 18.04 AVX2 ggml-tiny.en 4 217.28 1548.92
i7-5600U @2.60GHz Xubuntu 18.04 AVX2 ggml-tiny 4 215.59 1625.69
i7-5600U @2.60GHz Xubuntu 18.04 AVX2 ggml-base.en 4 275.62 3823.34
i7-5600U @2.60GHz Xubuntu 18.04 AVX2 ggml-base 4 275.72 3740.50
Cortex-A53 Android 10 NEON ggml-tiny.en 8 399.05 5841.70
Cortex-A53 Android 10 NEON ggml-tiny 8 376.25 5548.72
Cortex-A53 Android 10 NEON ggml-base.en 8 492.92 12728.42
Cortex-A53 Android 10 NEON ggml-base 8 1034.48 13365.86

Test-bench properties

  • Benchmarking is done on commit 3996ecc156486fb93ff505c01090d13192e72aa2.
  • Used cmake for building (mkdir build && cd build, cmake .. && make).
  • Compiler for Xubuntu 18.04 is gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
  • Compiler for Android 10 is clang version 15.0.2 (aarch64-unknown-linux-android24)
  • Used the following fish shell snippet to run the benchmarks:
# cwd is whisper.cpp/build
# Adding `-t 8` to `bench` for aarch64
$ for model in "ggml-tiny" "ggml-base"
      for suffix in "en.bin" "bin"
          ./bin/bench -m "../models/$model.$suffix"
      end
  end

Remarks

  • On x86, enabling BLAS (-DWHISPER_SUPPORT_OPENBLAS=ON) deteriorates the performance!
archi commented

Quite the difference between the 2017 Intel i3 4C/4T and the 2019 Ryzen Zen+ 6C/12T. And not looking good for AVX2 on the old AMD Zen+. I must admit, all in all I really envy the M1 for having that accelerator.

gcc vs clang doesn't seem to make a difference, at least it's not distinguishable from noise.

i3-8100

This is my home server. Tested while it was doing home server things (load 0.7). I can see this machine acting as a "whisper server" in a 2C configuration.

CPU OS Config Model Threads Load [ms] Encode [ms] Commit Compiler
i3-8100 @ 3.60GHz Arch Linux AVX2 tiny 1 88.38 2013.67 832b4f3 gcc 12.2.0
i3-8100 @ 3.60GHz Arch Linux AVX2 base 1 113.58 4692.04 832b4f3 gcc 12.2.0
i3-8100 @ 3.60GHz Arch Linux AVX2 small 1 225.74 18469.62 832b4f3 gcc 12.2.0
i3-8100 @ 3.60GHz Arch Linux AVX2 tiny 2 89.55 1189.92 832b4f3 clang 14.0.6
i3-8100 @ 3.60GHz Arch Linux AVX2 base 2 119.97 2756.52 832b4f3 clang 14.0.6
i3-8100 @ 3.60GHz Arch Linux AVX2 small 2 238.71 10491.67 832b4f3 clang 14.0.6
i3-8100 @ 3.60GHz Arch Linux AVX2 tiny 4 201.37 695.39 832b4f3 gcc 12.2.0
i3-8100 @ 3.60GHz Arch Linux AVX2 base 4 262.76 2023.16 832b4f3 gcc 12.2.0
i3-8100 @ 3.60GHz Arch Linux AVX2 small 4 526.66 6788.01 832b4f3 gcc 12.2.0
i3-8100 @ 3.60GHz Arch Linux AVX2 medium 4 3836.26 21889.30 832b4f3 gcc 12.2.0
i3-8100 @ 3.60GHz Arch Linux AVX2 large 4 26819.67 60880.62 832b4f3 gcc 12.2.0
i3-8100 @ 3.60GHz Arch Linux AVX2 tiny 4 89.05 696.08 832b4f3 clang 14.0.6
i3-8100 @ 3.60GHz Arch Linux AVX2 base 4 114.65 1711.15 832b4f3 clang 14.0.6
i3-8100 @ 3.60GHz Arch Linux AVX2 small 4 309.30 6995.25 832b4f3 clang 14.0.6
i3-8100 @ 3.60GHz Arch Linux AVX2 medium 4 4854.02 23570.42 832b4f3 clang 14.0.6
i3-8100 @ 3.60GHz Arch Linux AVX2 large 4 21415.07 60547.99 832b4f3 clang 14.0.6

Ryzen 1600AF

Just my Desktop. The difference to the 5950 at 8C is really massive; but luckily it has no impact for daily usage, so I'm glad I can still wait with upgrading to the last AM4 CPU generation 😂
Looking forward to benching CUDA on this machine (3080Ti).

CPU OS Config Model Threads Load [ms] Encode [ms] Commit Compiler
Ryzen 1600AF Manjaro AVX2 tiny 1 104.04 4691.38 832b4f3 clang 14.0.6
Ryzen 1600AF Manjaro AVX2 base 1 134.54 11092.84 832b4f3 clang 14.0.6
Ryzen 1600AF Manjaro AVX2 small 1 254.71 43923.42 832b4f3 clang 14.0.6
Ryzen 1600AF Manjaro AVX2 tiny 4 107.40 1336.49 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 base 4 132.69 3062.12 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 small 4 262.27 11655.22 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 medium 4 662.81 38829.74 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 large 4 1365.09 77063.30 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 tiny 6 100.82 1007.36 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 base 6 130.20 2472.55 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 small 6 256.83 9311.54 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 medium 6 657.89 28051.40 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 large 6 1190.62 54292.72 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 tiny 6 104.77 1012.70 832b4f3 clang 14.0.6
Ryzen 1600AF Manjaro AVX2 base 6 137.00 2212.20 832b4f3 clang 14.0.6
Ryzen 1600AF Manjaro AVX2 small 6 257.97 9296.33 832b4f3 clang 14.0.6
Ryzen 1600AF Manjaro AVX2 medium 6 624.04 28524.38 832b4f3 clang 14.0.6
Ryzen 1600AF Manjaro AVX2 large 6 1189.10 56445.31 832b4f3 clang 14.0.6
Ryzen 1600AF Manjaro AVX2 tiny 12 101.41 898.96 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 base 12 139.26 2200.78 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 small 12 256.50 8125.48 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 medium 12 623.59 29255.08 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 large 12 1192.90 51902.81 832b4f3 gcc 12.2.0
CPU OS Config Model Threads Load [ms] Encode [ms] Commit Compiler
POWER9v2 Gentoo -Ofast -mcpu=native base.en 4/64 144.84 42708.33 85c9ac1 clang 15.0.3
POWER9v2 Gentoo -Ofast -mcpu=native base.en 16/64 161.95 22302.28 85c9ac1 clang 15.0.3
POWER9v2 Gentoo -Ofast -mcpu=native base.en 32/64 142.06 20263.56 85c9ac1 clang 15.0.3
POWER9v2 Gentoo -Ofast -mcpu=native base.en 64/64 160.51 12645.79 85c9ac1 clang 15.0.3

@Xavier-i
WASM performance is much worse compared to native - this is expected.
Today I added the bench.wasm that can be used to benchmark performance in the browser.

Link: https://whisper.ggerganov.com/bench/

j1nx commented

Redo of my OpenVoiceOS Raspberry Pi 4 benchmark

CPU OS Config Model Th Load Enc. Commit
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny.en 4 735 9486 aa6adda
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base.en 4 950 25402 aa6adda
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny.en 4 752 9178 aa6adda
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base.en 4 969 19642 aa6adda

And just (and only) because we can, the same on a Raspberry Pi 3B+ running the same codebase / OS

CPU OS Config Model Th Load Enc. Commit
Raspberry Pi 3B+ - 1GB OpenVoiceOS NEON tiny.en 4 1331 22573 aa6adda
Raspberry Pi 3B+ - 1GB OpenVoiceOS NEON base.en 4 5886 58733 aa6adda
Raspberry Pi 3B+ - 1GB OpenVoiceOS NEON BLAS tiny.en 4 1333 21184 aa6adda
Raspberry Pi 3B+ - 1GB OpenVoiceOS NEON BLAS base.en 4 4605 47877 aa6adda
matth commented

I hope this isn't misplaced but I thought it interesting to share ...

I have recently finished some tests comparing whisper.cpp runtime performance against the original PyTorch version on various GPUs and CPUs.

We test against a fixed set of long form audio files (UK TV, each file ~1 hour long, mixed speech and noise) and record the runtime as a factor of real audio time.

Depending on the software and environment transcription can take anywhere between around 5x real-time to 0.14x real-time to complete.

ARM based whisper.cpp runtime is very impressive, in particular the Apple M1 performance can match that of the original PyTorch version on NVIDIA V100 and T4 gpus ...

CPU / GPU OS Config Model Threads xRT Transcribe
Intel Xeon Ubuntu 22.04 whisper original - pytorch cpu medium.en 8 4.78
Intel Xeon Ubuntu 22.04 whisper.cpp - AVX2 medium.en 8 4.44
Graviton 3 Ubuntu 22.04 whisper.cpp - NEON medium.en 8 0.63
mac2.metal OSX Ventura whisper.cpp - NEON BLAS medium.en 4 0.26
NVIDIA V100 Ubuntu 22.04 whisper original - pytorch cuda medium.en N/A 0.25
NVIDIA T4 Ubuntu 22.04 whisper original - pytorch cuda medium.en N/A 0.25
NVIDIA A10G Ubuntu 22.04 whisper original - pytorch cuda medium.en N/A 0.16
NVIDIA A100 Ubuntu 22.04 whisper original - pytorch cuda medium.en N/A 0.14

Additionally I did some very rough power consumption tests, again whisper.cpp on the M1 is really impressive against PyTorch on the GPU.

Platform Whisper Type Model Avg Power Peak Power
Apple M1 whisper.cpp ggml-medium.en 13202 mW 18412 mW
Nvidia T4 pytorch medium.en 69587 mW 85650 mW

Thanks for the fantastic work @ggerganov - this is a really inspiring project and demonstrates the ARM FP16 functionality wonderfully. Off to buy some more Apple Macs now ;)

@matth @rgerganov Been thinking myself that perf/watts for ML is truly outstanding and just wondered if the 8gb can squeeze the medium model in as not sure how memory is shared on the m1 or is it really a case of the 16gb?

@matth
Thanks for the data - it's interesting to see.

However, there are some caveats that are important to be considered when benchmarking the 2 implementations that I've been meaning to discuss, so here are my thoughts on this:

At a high-level, the Whisper transcription is a combination of 2 main parts:

  • transformer model evaluation
  • decoding strategy

The first part is branchless and does not depend on the audio input or the parameters that you use. For a given model, evaluating the transformer requires the same amount of operations every time. This is easy to benchmark.

The second part (decoding strategy) is different. The number of operations here depends both on the audio input contents and the decoding parameters / strategy that you use. For example, two different audio recordings with the same time length generally result in different decoded text based on the speech content and hence can take different amount of processing (even with the same decoding parameters). Also, the decoded timestamp tokens affect how the 30s sliding window of the transcription is updated and therefore can lead to a different number of transformer evaluations in total.

My understanding is that there is no "correct" decoding strategy. The OpenAI implementation generally offers 2 different strategies - Greedy and BeamSearch. Both of them are combinations of various heuristics that aim to improve the text coherency and reduce the number of catastrophic failures.

In whisper.cpp we currently have a Greedy strategy which is similar to the one in the OpenAI repo, but is not exactly the same.

So all of this means that there is no point in comparing the 2 implementations by measuring the total time to transcribe an audio, because the decoding strategy is not the same and therefore the variation will be very large due to the factors outlined above. It only makes sense to benchmark the transformer evaluation in isolation, because it is well-defined.

That is why in the benchmarks in this issue, I chose to run the Encoder on some random input buffer. The Encoder is the heavy part of the transformer and being able to evaluate it efficiently is very important and is the most defining factor for the efficiency of the implementation. It's the "engine" of the transcription. You can then put on top of it any decoding strategy that you like and this will define how accurate your transcription is. But it does not make sense to benchmark the performance of that anymore.

I think if we want to make a fair comparison with PyTorch, we need to have the bench tool implemented in python using PyTorch. Any other comparison will be flawed to some extent.

But in any case, your results are interesting - thanks for sharing them.
What parameters did you use for the PyTorch runs?


Regarding the power consumption - I think there is more we can do in whisper.cpp. Currently, the thread synchronization uses busy loops which is very power inefficient because it keeps the CPU at 100%, but it gives a slight performance edge. I am thinking of adding an option that uses condition variable synchronization which will likely reduce the power usage at the cost of some performance. For some use cases, it could be beneficial to have lower power consumption.

matth commented

Thanks @ggerganov , we are using PyTorch whisper with default settings in that benchmark so I believe that is a beam search decoder. I will see if I can test again with the greedy decoder for a more similar comparison. I think I understand your point though - these are not like for like implementations so at a certain level the comparison is flawed.

I also neglected to measure the PyTorch version on the M1 & Graviton which was a huge oversight!

There's a motivation behind these benchmarks. Looking at various solutions as improvements to existing transcription capabilities - each solution in my mind is a balance of accuracy, completeness, runtime, financial cost and energy efficiency.

On one end you have paying humans to do the transcription, slow and expensive but very accurate and something that is still done at a massive scale in my industry. At the other end there are existing Kaldi models that are less accurate but incredibly fast for inference on the CPU and very cheap to run.

I feel larger transformer models like Whisper sit somewhat in the middle of all this - closer to human accuracy but increased associated costs over existing software.

But whisper.cpp adds to this, if we can get similar or even just acceptable accuracy and runtime but on commodity hardware the choice can start to become more about cost, efficiency and functionality. e.g. you could buy 30+ Apple Macs for the price of an NVIDIA A100 server, being able to run Whisper on a laptop enables a different set of use cases, you can cut power consumption by a huge margin, etc

I think for me this is one of the many exciting outcomes of this project :)

@matth
Yeah - the default in PyTorch when running from the command line is BeamSearch.
I haven't measure exactly, but it is significantly slower compared to Greedy.

I think regarding the total-time benchmark - it can make sense once whisper.cpp reaches the accuracy of OpenAI. Currently, due to the inferior decoding, whisper.cpp has lower transcription accuracy (based on some results I saw floating around). But when the decoding gets improved and we have comparable accuracy, then we can make a benchmark that says:

"for a given word error rate (WER) the 2 implementation take this amount of processing time on average, over some large set of audio"

And another thing I was thinking is that even if today whisper.cpp is more efficient on Apple Macs - it is not going to be always the case. If I understand correctly, it's just a matter of time for the proper Apple Silicon frameworks (Metal, MPS, etc.) to become supported in PyTorch, Tensorflow, etc and when this happens (probably very soon), the performance of whisper.cpp will be the same or possibly worse.

So yeah - just trying to adjust expectations :) Will probably write some more on this in the F.A.Q. discussion.

CPU OS Config Model Th Load Enc. Commit
i9-9900K @ 3.60GHz macOS 12.6.2 AVX2 BLAS tiny.en 4 175 360 7282e21
i9-9900K @ 3.60GHz macOS 12.6.2 AVX2 BLAS base.en 4 233 736 7282e21
i9-9900K @ 3.60GHz macOS 12.6.2 AVX2 BLAS small.en 4 507 2400 7282e21
i9-9900K @ 3.60GHz macOS 12.6.2 AVX2 BLAS medium.en 4 1333 6860 7282e21

Using 8 threads is slightly slower to load, faster to encode:

CPU OS Config Model Th Load Enc. Commit
i9-9900K @ 3.60GHz macOS 12.6.2 AVX2 BLAS tiny.en 8 185 283 7282e21
i9-9900K @ 3.60GHz macOS 12.6.2 AVX2 BLAS base.en 8 241 579 7282e21
i9-9900K @ 3.60GHz macOS 12.6.2 AVX2 BLAS small.en 8 526 1959 7282e21
i9-9900K @ 3.60GHz macOS 12.6.2 AVX2 BLAS medium.en 8 1390 6271 7282e21
mgc8 commented
CPU OS Config Model Th Load Enc. Commit
MacBookPro M1 Max macOS 12.6 NEON BLAS tiny 8 65 108 a593b93
MacBookPro M1 Max macOS 12.6 NEON BLAS base 8 86 250 a593b93
MacBookPro M1 Max macOS 12.6 NEON BLAS small 8 185 789 a593b93
MacBookPro M1 Max macOS 12.6 NEON BLAS medium 8 493 2126 a593b93
MacBookPro M1 Max macOS 12.6 NEON BLAS large 8 955 3860 a593b93

There are actually 10 threads, but when using -t 10 the performance goes down. Lower numbers (such as -t 4) result in similar load performance, but slower encode (although not linear).

kha84 commented

AMD Ryzen 5 3400G (4 CPU cores, 8 threads) on Ubuntu 22.10 with 5.19.0-26-generic Kernel

4 threads

CPU OS Config Model Th Load Enc. Commit
3400G Ubuntu 22.10 AVX2 tiny 4 163 1415 0be6a1a
3400G Ubuntu 22.10 AVX2 tiny.en 4 175 1351 0be6a1a
3400G Ubuntu 22.10 AVX2 base.en 4 200 3095 0be6a1a
3400G Ubuntu 22.10 AVX2 base 4 205 3241 0be6a1a
3400G Ubuntu 22.10 AVX2 small.en 4 412 12343 0be6a1a
3400G Ubuntu 22.10 AVX2 small 4 421 11983 0be6a1a
3400G Ubuntu 22.10 AVX2 medium.en 4 995 38818 0be6a1a
3400G Ubuntu 22.10 AVX2 medium 4 1006 38573 0be6a1a
3400G Ubuntu 22.10 AVX2 large-v1 4 0be6a1a
3400G Ubuntu 22.10 AVX2 large 4 1870 77302 0be6a1a

8 threads is just marginally better

CPU OS Config Model Th Load Enc. Commit
3400G Ubuntu 22.10 AVX2 tiny.en 8 191 1275 0be6a1a
3400G Ubuntu 22.10 AVX2 tiny 8 183 1258 0be6a1a
3400G Ubuntu 22.10 AVX2 base.en 8 232 2894 0be6a1a
3400G Ubuntu 22.10 AVX2 base 8 231 2927 0be6a1a
3400G Ubuntu 22.10 AVX2 small.en 8 435 11299 0be6a1a
3400G Ubuntu 22.10 AVX2 small 8 414 11511 0be6a1a
3400G Ubuntu 22.10 AVX2 medium.en 8 1011 37557 0be6a1a
3400G Ubuntu 22.10 AVX2 medium 8 1049 37306 0be6a1a
3400G Ubuntu 22.10 AVX2 large-v1 8 0be6a1a
3400G Ubuntu 22.10 AVX2 large 8 3237 77396 0be6a1a

Someone mentioned BLAS?

bmilde commented

Whats the performance gain of this against the original implementation with pytorch compiled with AVX support or the pytorch m1 backend?

Does this implementation use beam decoding? (original pytorch impl has n=5 as default and is 100% faster with n=1)

Edit: README already mentions it's greedy decoding:

Very basic greedy sampling scheme - always pick up the token with highest probability. This should be similar to the GreedyDecoder from the original python implementation, so in order to make a fair comparison between the 2 implementations, make sure to run the python code with the following parameters:

whisper --best_of None --beam_size None ...

Greedy decoding is also 2x faster in the original implementation (on a GPU).

Orange Pi5 4Gb, Micro-SD not NVME

Starting to touch zram swap on medium and then file swap pretty hard on large

CPU OS Config Model Th Load Enc. Commit
rk3588s Bullseye 5.10.110 NEON tiny 8 352 2876 0be6a1a
rk3588s Bullseye 5.10.110 NEON base 8 346 6213 0be6a1a
rk3588s Bullseye 5.10.110 NEON small 8 690 25808 0be6a1a
rk3588s Bullseye 5.10.110 NEON medium 8 23987 93995 0be6a1a
rk3588s Bullseye 5.10.110 NEON large 8 49633 190601 0be6a1a

Even with a 4:4 big:little its a touch faster taskset -c 4-7 ./extra/bench-all.sh

CPU OS Config Model Th Load Enc. Commit
rk3588s Bullseye 5.10.110 NEON tiny 4 356 2716 0be6a1a
rk3588s Bullseye 5.10.110 NEON base 4 417 6661 0be6a1a
rk3588s Bullseye 5.10.110 NEON small 4 943 25357 0be6a1a
rk3588s Bullseye 5.10.110 NEON medium 4 17748 90187 0be6a1a
rk3588s Bullseye 5.10.110 NEON large 4 48793 182800 0be6a1a

Compiling on a rk3588 with -march=native -ffast-math seems to give a big boost taskset -c 4-7 ./extra/bench-all.sh

CPU OS Config Model Th Load Enc. Commit
rk3588s Bullseye 5.10.110 NEON tiny 4 280 1074 0be6a1a
rk3588s Bullseye 5.10.110 NEON base 4 466 3491 0be6a1a
rk3588s Bullseye 5.10.110 NEON small 4 780 11052 0be6a1a
rk3588s Bullseye 5.10.110 NEON medium 4 15361 42252 0be6a1a
rk3588s Bullseye 5.10.110 NEON large 4 49331 91892 0be6a1a

Intel Celeron N4120 (4 cores, 4 threads) on Artix Linux 6.0.12-artix1-1.

CPU OS Config Model Th Load Enc. Commit
N4120 Artix 6.0.12-artix1-1 BLAS tiny 4 330 12272 65fdcbb
N4120 Artix 6.0.12-artix1-1 BLAS base 4 65fdcbb
N4120 Artix 6.0.12-artix1-1 BLAS small 4 892 83209 65fdcbb
N4120 Artix 6.0.12-artix1-1 BLAS medium 4 5478 237677 65fdcbb

Base 14 inch M1 Macbook Pro with NEON enabled:

CPU OS Config RAM (GB) Th Model Load (ms) Enc. (ms) Total
M1 Pro OSX 12.5.1 NEON 16 8 Tiny.en 107 269.72 376.91
M1 Pro OSX 12.5.1 NEON 16 8 Base.en 92 321 413.77
M1 Pro OSX 12.5.1 NEON 16 8 Small.en 264 978 1243.24

16 Inch Base Apple M2 Pro results

CPU OS Config RAM (GB) Th Model Load (ms) Enc. (ms) Total (ms)
M2 Pro OSX 13.2 NEON 16 8 Tiny.en 118 143 261
M2 Pro OSX 13.2 NEON 16 8 Tiny 118 143 261
M2 Pro OSX 13.2 NEON 16 8 Base.en 173 235 408
M2 Pro OSX 13.2 NEON 16 8 Base 148 266 414
M2 Pro OSX 13.2 NEON 16 8 Small.en 304 739 1042
M2 Pro OSX 13.2 NEON 16 8 Small 277(?) 720 997
M2 Pro OSX 13.2 NEON 16 8 Medium.en 747 2057 2804
M2 Pro OSX 13.2 NEON 16 8 Medium 657 2055 2712
M2 Pro OSX 13.2 NEON 16 8 Large 2126 4223 6349

I couldn't get bench to run on my iPhone 12, so I have attached my ad-hoc results below with the input audio "I love transcriber apps":

CPU DGGML_USE_ACCELERATE OS Model Load Mel Sample Enc. Dec. Total (ms)
A14 Release IOS 16.1 Base.en 150 23 2 2447 112 2584

--

This might appear obvious to some, but it wasn't to me so I'll note it here: I saw much better results using large steps lengths and sample sizes with "./stream". I feel like under the hood, Whisper relies greatly on 'whole-sentence' context to infer individual words.

j1nx commented

With the new beta 1.1.0 release. On first notice, not to much difference. Will not rebuild without OpenBLAS as it was slightly better with it on the rpi4.

CPU OS Config Model Th Load Enc. Commit
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 4 751 9506 ecda7f786a
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny.en 4 748 9295 ecda7f786a
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 4 971 23512 ecda7f786a
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base.en 4 958 24263 ecda7f786a
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS small 4 2238 84720 ecda7f786a
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS small.en 4 3880 86031 ecda7f786a

Results on 12th Gen Intel(R) Core(TM) i3-12300T:

CPU OS Config Model Th Load Enc. Commit
Core i3-12300T Debian 11 (Docker on Win11) AVX2 tiny.en 4 97 679 49b529b
Core i3-12300T Debian 11 (Docker on Win11) AVX2 tiny 4 90 580 49b529b
Core i3-12300T Debian 11 (Docker on Win11) AVX2 base 4 138 1478 49b529b

With OpenBLAS (considerably worse):

CPU OS Config Model Th Load Enc. Commit
Core i3-12300T Debian 11 (Docker on Win11) AVX2 BLAS tiny 4 117 1644 49b529b
Core i3-12300T Debian 11 (Docker on Win11) AVX2 BLAS base 4 122 2890 49b529b
johtso commented

The benchmarks for the macbook pro m1 are using 8 threads, but in my experience it runs nearly twice as fast with 4 threads. Am I missing something?

Edit:
I just ran the benchmark with the large model.. and it actually made almost no difference whether 8 or 4 threads were used. But with real world workloads it makes a huge difference. Interesting.

Running memcpy benchmark with 1 thread
memcpy: 8.66 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat:    64 x    64: F16      4.2 GFLOPS (128 runs) / F32      3.5 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     10.1 GFLOPS (128 runs) / F32      6.3 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     13.0 GFLOPS (128 runs) / F32      7.2 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     14.0 GFLOPS ( 53 runs) / F32      7.1 GFLOPS ( 27 runs)
ggml_mul_mat:  1024 x  1024: F16     29.8 GFLOPS ( 15 runs) / F32     17.8 GFLOPS (  9 runs)
ggml_mul_mat:  2048 x  2048: F16     37.8 GFLOPS (  3 runs) / F32     19.6 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     40.0 GFLOPS (  3 runs) / F32     17.4 GFLOPS (  3 runs)

Running benchmark for all models

CPU OS Config Model Th Load Enc. Commit
rk3588s Ubuntu 22.04 NEON tiny 4 257 1179 21c569b
rk3588s Ubuntu 22.04 NEON base 4 326 2967 21c569b
rk3588s Ubuntu 22.04 NEON small 4 661 10560 21c569b
rk3588s Ubuntu 22.04 NEON medium 4 23188 35867 21c569b
mscdex commented

Compiler: gcc version 12.2.0 (Ubuntu 12.2.0-3ubuntu1)

memcpy: 16.74 GB/s
sum:    error -536870997.000000
Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16     16.2 GFLOPS (128 runs) / F32     16.4 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     70.1 GFLOPS (128 runs) / F32     66.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    133.9 GFLOPS (128 runs) / F32    105.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    161.2 GFLOPS (128 runs) / F32    109.3 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    204.4 GFLOPS ( 96 runs) / F32    121.9 GFLOPS ( 57 runs)
ggml_mul_mat:  2048 x  2048: F16    254.4 GFLOPS ( 15 runs) / F32    149.3 GFLOPS (  9 runs)
ggml_mul_mat:  4096 x  4096: F16    184.2 GFLOPS (  3 runs) / F32     54.1 GFLOPS (  3 runs)

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat:    64 x    64: F16      8.4 GFLOPS (128 runs) / F32      9.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     58.1 GFLOPS (128 runs) / F32     57.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    170.3 GFLOPS (128 runs) / F32    159.9 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    315.7 GFLOPS (128 runs) / F32    230.8 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    356.0 GFLOPS (128 runs) / F32    224.9 GFLOPS (105 runs)
ggml_mul_mat:  2048 x  2048: F16    499.5 GFLOPS ( 30 runs) / F32    292.4 GFLOPS ( 18 runs)
ggml_mul_mat:  4096 x  4096: F16    265.9 GFLOPS (  3 runs) / F32     66.2 GFLOPS (  3 runs)

Running ggml_mul_mat benchmark with 16 threads

ggml_mul_mat:    64 x    64: F16      3.6 GFLOPS (128 runs) / F32      3.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     16.7 GFLOPS (128 runs) / F32     27.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     88.1 GFLOPS (128 runs) / F32    126.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    263.5 GFLOPS (128 runs) / F32    229.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    396.1 GFLOPS (128 runs) / F32    272.8 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    498.6 GFLOPS ( 30 runs) / F32    314.9 GFLOPS ( 19 runs)
ggml_mul_mat:  4096 x  4096: F16    337.7 GFLOPS (  3 runs) / F32    112.0 GFLOPS (  3 runs)
CPU OS Config Model Th Load Enc. Commit
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 tiny.en 4 104 247 78f1661
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 base.en 4 130 585 78f1661
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 small.en 4 264 1940 78f1661
--- -- ------ ----- -- ---- ---- ------
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 tiny.en 8 99 166 78f1661
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 base.en 8 123 329 78f1661
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 small.en 8 262 1148 78f1661
--- -- ------ ----- -- ---- ---- ------
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 tiny.en 16 100 160 78f1661
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 base.en 16 123 338 78f1661
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 small.en 16 262 1139 78f1661

Tested on my M2 Macbook Air:

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

./extra/bench-all.sh
Running memcpy benchmark with 1 thread
memcpy: 31.42 GB/s
sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 11.8 GFLOPS (128 runs) / F32 10.6 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 89.9 GFLOPS (128 runs) / F32 74.7 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 434.5 GFLOPS (128 runs) / F32 419.9 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 885.4 GFLOPS (128 runs) / F32 913.2 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 1023.4 GFLOPS (128 runs) / F32 1037.7 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 971.6 GFLOPS ( 57 runs) / F32 950.1 GFLOPS ( 56 runs)
ggml_mul_mat: 4096 x 4096: F16 914.9 GFLOPS ( 7 runs) / F32 820.7 GFLOPS ( 6 runs)

CPU OS Config Model Th Load Enc. Commit
M2 OSX 13.0.1 NEON BLAS tiny 4 63 153 1a91c19
M2 OSX 13.0.1 NEON BLAS base 4 92 329 1a91c19
M2 OSX 13.0.1 NEON BLAS small 4 198 1014 1a91c19
M2 OSX 13.0.1 NEON BLAS medium 4 564 3042 1a91c19
M2 OSX 13.0.1 NEON BLAS large 4 1152 5466 1a91c19

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat: 64 x 64: F16 5.7 GFLOPS (128 runs) / F32 3.9 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 45.0 GFLOPS (128 runs) / F32 25.8 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 272.7 GFLOPS (128 runs) / F32 166.1 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 747.6 GFLOPS (128 runs) / F32 748.8 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 998.7 GFLOPS (128 runs) / F32 895.8 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 716.0 GFLOPS ( 42 runs) / F32 717.2 GFLOPS ( 42 runs)
ggml_mul_mat: 4096 x 4096: F16 790.4 GFLOPS ( 6 runs) / F32 726.3 GFLOPS ( 6 runs)

CPU OS Config Model Th Load Enc. Commit
M2 OSX 13.0.1 NEON BLAS tiny 8 66 154 1a91c19
M2 OSX 13.0.1 NEON BLAS base 8 92 346 1a91c19
M2 OSX 13.0.1 NEON BLAS small 8 211 1171 1a91c19
M2 OSX 13.0.1 NEON BLAS medium 8 562 3848 1a91c19
M2 OSX 13.0.1 NEON BLAS large 8 1079 6230 1a91c19

This is bench result :

whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem required = 500.00 MB (+ 6.00 MB per decoder)
whisper_model_load: kv self size = 5.25 MB
whisper_model_load: kv cross size = 17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 140.60 MB
whisper_model_load: model size = 140.54 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: load time = 1245.39 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 88596.32 ms / 1 runs (88596.32 ms per run)
whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 89841.85 ms

This is cpuinfo :

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping : 7
microcode : 0x2f
cpu MHz : 2990.383
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown
bogomips : 4983.97
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping : 7
microcode : 0x2f
cpu MHz : 2990.384
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown
bogomips : 4983.97
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping : 7
microcode : 0x2f
cpu MHz : 2990.384
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown
bogomips : 4983.97
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping : 7
microcode : 0x2f
cpu MHz : 2990.384
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown
bogomips : 4983.97
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

./bench -w 1 -t 1

memcpy: 3.35 GB/s
sum: error -536870997.000000
./bench -w 2 -t 1

ggml_mul_mat: 64 x 64: F16 0.7 GFLOPS (128 runs) / F32 3.3 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 0.7 GFLOPS (128 runs) / F32 3.7 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 0.6 GFLOPS ( 18 runs) / F32 3.3 GFLOPS ( 99 runs)
ggml_mul_mat: 512 x 512: F16 0.6 GFLOPS ( 3 runs) / F32 3.6 GFLOPS ( 14 runs)
ggml_mul_mat: 1024 x 1024: F16 0.7 GFLOPS ( 3 runs) / F32 2.3 GFLOPS ( 3 runs)
ggml_mul_mat: 2048 x 2048: F16 0.7 GFLOPS ( 3 runs) / F32 2.4 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 1.2 GFLOPS ( 3 runs) / F32 3.0 GFLOPS ( 3 runs)

Thinkpad T520, on Linux Mint Debian Edition, with commented out AVX1 on Makefile

Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 38.84 GB/s
sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 9.8 GFLOPS (128 runs) / F32 8.4 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 69.4 GFLOPS (128 runs) / F32 62.1 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 455.3 GFLOPS (128 runs) / F32 383.8 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 1141.1 GFLOPS (128 runs) / F32 1550.2 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 2302.0 GFLOPS (128 runs) / F32 2962.9 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 3035.6 GFLOPS (128 runs) / F32 3217.5 GFLOPS (128 runs)
ggml_mul_mat: 4096 x 4096: F16 3431.7 GFLOPS ( 25 runs) / F32 3510.6 GFLOPS ( 26 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
M1 Ultra 13.2 NEON BLAS tiny 4 71 139 2bee265
M1 Ultra 13.2 NEON BLAS base 4 95 266 2bee265
M1 Ultra 13.2 NEON BLAS small 4 222 806 2bee265
M1 Ultra 13.2 NEON BLAS medium 4 598 2175 2bee265
M1 Ultra 13.2 NEON BLAS large 4 1165 3895 2bee265

Here are new results for POWER9, now that #300 is closed.

Running memcpy benchmark with 1 thread

memcpy: 6.32 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 32 threads

ggml_mul_mat:    64 x    64: F16      0.4 GFLOPS (128 runs) / F32      0.4 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16      2.8 GFLOPS (128 runs) / F32      2.8 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     13.4 GFLOPS (128 runs) / F32     23.0 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     32.9 GFLOPS (123 runs) / F32     87.9 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16     47.9 GFLOPS ( 23 runs) / F32    127.4 GFLOPS ( 60 runs)
ggml_mul_mat:  2048 x  2048: F16     58.5 GFLOPS (  4 runs) / F32     67.3 GFLOPS (  4 runs)
ggml_mul_mat:  4096 x  4096: F16     23.8 GFLOPS (  3 runs) / F32     21.2 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!
CPU OS Config Model Th Load Enc. Commit Compiler
POWER9 Debian 11 tiny 32 75 1283 3b010f9 GCC 10.2.1
POWER9 Debian 11 base 32 96 2786 3b010f9 GCC 10.2.1
POWER9 Debian 11 small 32 182 8534 3b010f9 GCC 10.2.1
POWER9 Debian 11 medium 32 463 22282 3b010f9 GCC 10.2.1
POWER9 Debian 11 large 32 838 41106 3b010f9 GCC 10.2.1

I got referred here from openai/whisper#978 (comment)
This seems really interesting.

I'm running on Oracle Cloud's free tier, which contains 4x Ampere A1 CPUs and 24G RAM.


Compiler:

I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

Default (no changes)

~/whisper.cpp$ extra/bench-all.sh
Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 10.92 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      1.0 GFLOPS (128 runs) / F32      0.7 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     16.8 GFLOPS (128 runs) / F32     13.2 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     18.5 GFLOPS (128 runs) / F32     41.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     21.5 GFLOPS ( 81 runs) / F32     35.4 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16     23.2 GFLOPS ( 11 runs) / F32     41.4 GFLOPS ( 20 runs)
ggml_mul_mat:  2048 x  2048: F16     23.4 GFLOPS (  3 runs) / F32     32.6 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     22.5 GFLOPS (  3 runs) / F32     21.4 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!
CPU OS Config Model Th Load Enc. Commit
Ampere A1 Ubuntu 22.04 NEON tiny 4 83 1832 ca21f7a
Ampere A1 Ubuntu 22.04 NEON base 4 120 4767 ca21f7a
Ampere A1 Ubuntu 22.04 NEON small 4 273 17529 ca21f7a
Ampere A1 Ubuntu 22.04 NEON medium 4 739 59794 ca21f7a
Ampere A1 Ubuntu 22.04 NEON large 4 1436 115771 ca21f7a

With changes mentioned in openai/whisper#978 (comment)
Thanks again @jan-grzybek-ampere

~/whisper.cpp$ extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 10.88 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      2.0 GFLOPS (128 runs) / F32      1.7 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     14.3 GFLOPS (128 runs) / F32     33.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     40.7 GFLOPS (128 runs) / F32     54.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     97.5 GFLOPS (128 runs) / F32     31.4 GFLOPS (117 runs)
ggml_mul_mat:  1024 x  1024: F16     87.1 GFLOPS ( 41 runs) / F32     41.0 GFLOPS ( 20 runs)
ggml_mul_mat:  2048 x  2048: F16     74.3 GFLOPS (  5 runs) / F32     33.4 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     50.4 GFLOPS (  3 runs) / F32     21.5 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!
CPU OS Config Model Th Load Enc. Commit
Ampere A1 Ubuntu 22.04 NEON tiny 4 84 619 ca21f7a
Ampere A1 Ubuntu 22.04 NEON base 4 124 2036 ca21f7a
Ampere A1 Ubuntu 22.04 NEON small 4 293 5872 ca21f7a
Ampere A1 Ubuntu 22.04 NEON medium 4 817 22064 ca21f7a
Ampere A1 Ubuntu 22.04 NEON large 4 1446 37996 ca21f7a

Done a bit of reading and done several more tests.

According to https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu , the recommendation is to use -mcpu=native and I did indeed get the best performance with it.
Will put in a pull request to use -mcpu=native for aarch64.
No significant difference between GCC 11.3 and GCC 12.1 on Ubuntu 22.04.


-march=armv8.2-a+fp16, gcc-11.3

Performance seems slightly worse compared to yesterday's test in #89 (comment)
I re-ran all of the following tests one after another to hopefully obtain comparable figures.
This is a free instance on Oracle Cloud and perhaps others are using the other cores on the CPU.

make clean
make main bench
./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 10.82 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      1.8 GFLOPS (128 runs) / F32      2.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     40.7 GFLOPS (128 runs) / F32     12.7 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     52.9 GFLOPS (128 runs) / F32     32.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     97.3 GFLOPS (128 runs) / F32     32.1 GFLOPS (120 runs)
ggml_mul_mat:  1024 x  1024: F16     77.0 GFLOPS ( 36 runs) / F32     35.1 GFLOPS ( 17 runs)
ggml_mul_mat:  2048 x  2048: F16     64.0 GFLOPS (  4 runs) / F32     25.9 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     45.8 GFLOPS (  3 runs) / F32     21.0 GFLOPS (  3 runs)
CPU OS Config Model Th Load Enc. Commit
Ampere A1 Ubuntu 22.04 NEON tiny 4 85 662 ca21f7a
Ampere A1 Ubuntu 22.04 NEON base 4 121 2039 ca21f7a
Ampere A1 Ubuntu 22.04 NEON small 4 281 6667 ca21f7a
Ampere A1 Ubuntu 22.04 NEON medium 4 760 25355 ca21f7a
Ampere A1 Ubuntu 22.04 NEON large 4 1456 45563 ca21f7a

-mcpu=native, gcc-11.3

make clean
make main bench
./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 10.85 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      7.9 GFLOPS (128 runs) / F32      1.8 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16      7.5 GFLOPS (128 runs) / F32     12.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     51.8 GFLOPS (128 runs) / F32     54.4 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     96.3 GFLOPS (128 runs) / F32     31.2 GFLOPS (117 runs)
ggml_mul_mat:  1024 x  1024: F16     74.1 GFLOPS ( 35 runs) / F32     33.5 GFLOPS ( 16 runs)
ggml_mul_mat:  2048 x  2048: F16     67.1 GFLOPS (  4 runs) / F32     27.0 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     49.3 GFLOPS (  3 runs) / F32     21.7 GFLOPS (  3 runs)
CPU OS Config Model Th Load Enc. Commit
Ampere A1 Ubuntu 22.04 NEON tiny 4 85 655 ca21f7a
Ampere A1 Ubuntu 22.04 NEON base 4 121 2002 ca21f7a
Ampere A1 Ubuntu 22.04 NEON small 4 283 6923 ca21f7a
Ampere A1 Ubuntu 22.04 NEON medium 4 762 24085 ca21f7a
Ampere A1 Ubuntu 22.04 NEON large 4 1459 43846 ca21f7a

-mcpu=native, gcc-12.1

make clean
make CC=gcc-12 CXX=g++-12 main bench
./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 11.01 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      8.0 GFLOPS (128 runs) / F32      8.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     12.0 GFLOPS (128 runs) / F32     12.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     55.7 GFLOPS (128 runs) / F32     41.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     95.1 GFLOPS (128 runs) / F32     30.2 GFLOPS (113 runs)
ggml_mul_mat:  1024 x  1024: F16     67.1 GFLOPS ( 32 runs) / F32     33.0 GFLOPS ( 16 runs)
ggml_mul_mat:  2048 x  2048: F16     64.2 GFLOPS (  4 runs) / F32     26.8 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     46.1 GFLOPS (  3 runs) / F32     21.4 GFLOPS (  3 runs)
CPU OS Config Model Th Load Enc. Commit
Ampere A1 Ubuntu 22.04 NEON tiny 4 84 613 ca21f7a
Ampere A1 Ubuntu 22.04 NEON base 4 122 2086 ca21f7a
Ampere A1 Ubuntu 22.04 NEON small 4 286 6375 ca21f7a
Ampere A1 Ubuntu 22.04 NEON medium 4 761 24667 ca21f7a
Ampere A1 Ubuntu 22.04 NEON large 4 1457 43826 ca21f7a
  • CPU model: AMD Ryzen 9 7950X
  • Operating system: Windows 10 Pro N 22H2
  • Compiler: Windows x64 release v1.2.1

whisper-bin-x64

>bench.exe
whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: kv self size  =    5.25 MB
whisper_model_load: kv cross size =   17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   109.45 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =   919.30 ms /     1 runs (  919.30 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1032.75 ms
>bench -w 1 -t 1
memcpy: 24.58 GB/s
sum:    error -536870819.000000
>bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     22.7 GFLOPS (128 runs) / F32     38.7 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     34.6 GFLOPS (128 runs) / F32     45.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     44.2 GFLOPS (128 runs) / F32     54.5 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     50.5 GFLOPS (128 runs) / F32     55.3 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16     53.2 GFLOPS ( 25 runs) / F32     65.7 GFLOPS ( 31 runs)
ggml_mul_mat:  2048 x  2048: F16     54.9 GFLOPS (  4 runs) / F32     61.8 GFLOPS (  4 runs)
ggml_mul_mat:  4096 x  4096: F16     50.7 GFLOPS (  3 runs) / F32     19.9 GFLOPS (  3 runs)

That last one is less than the 5950X above, weird. Oh, OpenBLAS below:

whisper-blas-bin-x64

>bench
whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: kv self size  =    5.25 MB
whisper_model_load: kv cross size =   17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   101.76 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =   602.63 ms /     1 runs (  602.63 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   705.80 ms
>bench -w 1 -t 1
memcpy: 24.30 GB/s
sum:    error -536870819.000000
>bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     89.4 GFLOPS (128 runs) / F32    119.6 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     27.6 GFLOPS (128 runs) / F32     31.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    172.9 GFLOPS (128 runs) / F32    222.0 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    596.8 GFLOPS (128 runs) / F32    926.4 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1257.0 GFLOPS (128 runs) / F32   1887.7 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1726.5 GFLOPS (101 runs) / F32   2193.9 GFLOPS (128 runs)
ggml_mul_mat:  4096 x  4096: F16   2109.8 GFLOPS ( 16 runs) / F32   2237.5 GFLOPS ( 17 runs)

memcpy: 7.20 GB/s
sum: error -536870997.000000

CPU OS Config Model Th Load Enc. Commit
AMD Ryzen 3 3200U Linux Mint 21.1 AVX2 tiny 4 109 3417 09e9068
AMD Ryzen 3 3200U Linux Mint 21.1 AVX2 base 4 180 7907 09e9068
AMD Ryzen 3 3200U Linux Mint 21.1 AVX2 small 4 419 30899 09e9068
AMD Ryzen 3 3200U Linux Mint 21.1 AVX2 medium 4 1851 106542 09e9068
AMD Ryzen 3 3200U Linux Mint 21.1 AVX2 large 4 4715 203455 09e9068

memcpy: 15.57 GB/s

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat: 64 x 64: F16 6.1 GFLOPS (128 runs) / F32 6.2 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 40.1 GFLOPS (128 runs) / F32 38.7 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 147.9 GFLOPS (128 runs) / F32 110.1 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 264.9 GFLOPS (128 runs) / F32 134.4 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 289.5 GFLOPS (128 runs) / F32 151.9 GFLOPS ( 71 runs)
ggml_mul_mat: 2048 x 2048: F16 290.6 GFLOPS ( 17 runs) / F32 70.7 GFLOPS ( 5 runs)
ggml_mul_mat: 4096 x 4096: F16 114.0 GFLOPS ( 3 runs) / F32 62.7 GFLOPS ( 3 runs)

CPU OS Config Model Th Load Enc. Commit
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 tiny 8 50 361 09e9068
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 base 8 70 1000 09e9068
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 small 8 185 2264 09e9068
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 medium 8 587 8421 09e9068
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 large 8 2296 15759 09e9068

Running ggml_mul_mat benchmark with 16 threads

ggml_mul_mat: 64 x 64: F16 2.1 GFLOPS (128 runs) / F32 1.9 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 19.6 GFLOPS (128 runs) / F32 14.8 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 68.1 GFLOPS (128 runs) / F32 84.5 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 200.5 GFLOPS (128 runs) / F32 141.4 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 271.0 GFLOPS (127 runs) / F32 163.7 GFLOPS ( 77 runs)
ggml_mul_mat: 2048 x 2048: F16 205.5 GFLOPS ( 12 runs) / F32 71.6 GFLOPS ( 5 runs)
ggml_mul_mat: 4096 x 4096: F16 142.3 GFLOPS ( 3 runs) / F32 63.0 GFLOPS ( 3 runs)

CPU OS Config Model Th Load Enc. Commit
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 tiny 16 52 329 09e9068
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 base 16 72 723 09e9068
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 small 16 188 2214 09e9068
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 medium 16 698 10889 09e9068
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 large 16 1619 16640 09e9068

MacBook Pro 14" with M2 Pro

  • 10 Cores, 16GB RAM
  • macOS Ventura 13.2
  • Benchmarks running at 8 threads
CPU OS Config Model Th Load Enc. Commit
Apple M2 Pro macOS 13.2 NEON BLAS tiny 8 76 161 09e9068
Apple M2 Pro macOS 13.2 NEON BLAS base 8 104 318 09e9068
Apple M2 Pro macOS 13.2 NEON BLAS small 8 221 975 09e9068
Apple M2 Pro macOS 13.2 NEON BLAS medium 8 969 2692 09e9068
Apple M2 Pro macOS 13.2 NEON BLAS large 8 1939 4959 09e9068

NVIDIA Jetson Nano, without GPU optimization:
base-en

 ./bin/main -f samples/jfk.wav 
whisper_init_from_file_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   354.49 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   712.86 ms
whisper_print_timings:   sample time =    79.37 ms /    27 runs (    2.94 ms per run)
whisper_print_timings:   encode time = 24406.28 ms /     1 runs (24406.28 ms per run)
whisper_print_timings:   decode time =  1284.84 ms /    27 runs (   47.59 ms per run)
whisper_print_timings:    total time = 26908.31 ms

tiny-en

./bin/main -m ./models/ggml-tiny.en.bin  -f ./samples/jfk.wav 
whisper_init_from_file_no_state: loading model from './models/ggml-tiny.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  127.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   73.58 MB
whisper_model_load: model size    =   73.54 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.740]   And so my fellow Americans ask not what your country can do for you
[00:00:07.740 --> 00:00:10.740]   ask what you can do for your country


whisper_print_timings:     load time =   204.60 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   564.90 ms
whisper_print_timings:   sample time =    72.13 ms /    26 runs (    2.77 ms per run)
whisper_print_timings:   encode time =  9232.34 ms /     1 runs ( 9232.34 ms per run)
whisper_print_timings:   decode time =   616.00 ms /    26 runs (   23.69 ms per run)
whisper_print_timings:    total time = 10745.65 ms

MacBook Pro 14" with M2 Pro
10 Cores, 32GB RAM
macOS Ventura 13.2
Benchmarks running at 8 threads
memcpy: 40.68 GB/s

| CPU          | OS     | Config     | Model    | Th | Load | Enc. | Commit  |
| ------------ | ------ | ---------- | -------- | -- | ---- | ---- | ------- |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | tiny     | 8  | 45   | 93   | 09e9068 |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | base     | 8  | 68   | 187  | 09e9068 |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | small    | 8  | 179  | 702  | 09e9068 |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | medium   | 8  | 496  | 2227 | 09e9068 |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | large    | 8  | 1037 | 3796 | 09e9068 |

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat:    64 x    64: F16      4.6 GFLOPS (128 runs) / F32      4.1 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     46.6 GFLOPS (128 runs) / F32     36.4 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    294.2 GFLOPS (128 runs) / F32    238.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    611.0 GFLOPS (128 runs) / F32    712.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    770.9 GFLOPS (128 runs) / F32    700.3 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    902.8 GFLOPS ( 53 runs) / F32    906.9 GFLOPS ( 53 runs)
ggml_mul_mat:  4096 x  4096: F16   1521.2 GFLOPS ( 12 runs) / F32   1469.3 GFLOPS ( 11 runs)

MacBook Pro 16" with M2 Max
12 Cores, 96GB RAM
macOS Ventura 13.3
Benchmarks running at 4 threads (4 threads were faster than 8 threads for ggml_mul_mat but about same for model load/encode)
memcpy: 49.94 GB/s
sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16     11.2 GFLOPS (128 runs) / F32      9.3 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     83.0 GFLOPS (128 runs) / F32     73.7 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    505.2 GFLOPS (128 runs) / F32    488.2 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16   1018.0 GFLOPS (128 runs) / F32   1196.3 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1796.2 GFLOPS (128 runs) / F32   2087.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1638.8 GFLOPS ( 96 runs) / F32   1673.7 GFLOPS ( 98 runs)
ggml_mul_mat:  4096 x  4096: F16   1995.2 GFLOPS ( 15 runs) / F32   2037.8 GFLOPS ( 15 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
Apple M2 Max 13.3 NEON BLAS tiny 4 41 118 0a2d121
Apple M2 Max 13.3 NEON BLAS base 4 61 230 0a2d121
Apple M2 Max 13.3 NEON BLAS small 4 153 734 0a2d121
Apple M2 Max 13.3 NEON BLAS medium 4 448 1979 0a2d121
Apple M2 Max 13.3 NEON BLAS large 4 882 3553 0a2d121

Running memcpy benchmark with 1 thread

memcpy: 7.03 GB/s
sum: error -536870997.000000 - how fix ??

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      8.9 GFLOPS (128 runs) / F32     10.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     53.3 GFLOPS (128 runs) / F32     47.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     91.7 GFLOPS (128 runs) / F32     99.4 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    134.2 GFLOPS (128 runs) / F32     94.8 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    182.9 GFLOPS ( 86 runs) / F32    121.2 GFLOPS ( 57 runs)
ggml_mul_mat:  2048 x  2048: F16    180.0 GFLOPS ( 11 runs) / F32     42.4 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     59.1 GFLOPS (  3 runs) / F32     31.5 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
Ryzen 7 PRO 5850U Ubuntu 22.04.2 AVX2 tiny 4 69 495 0a2d121
Ryzen 7 PRO 5850U Ubuntu 22.04.2 AVX2 base 4 111 1128 0a2d121
Ryzen 7 PRO 5850U Ubuntu 22.04.2 AVX2 small 4 264 3992 0a2d121
Ryzen 7 PRO 5850U Ubuntu 22.04.2 AVX2 medium 4 806 12230 0a2d121
Ryzen 7 PRO 5850U Ubuntu 22.04.2 AVX2 large 4 1919 25574 0a2d121

memcpy: 9.49 GB/s
sum: error -536870997.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 8.8 GFLOPS (128 runs) / F32 10.0 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 35.4 GFLOPS (128 runs) / F32 49.2 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 61.9 GFLOPS (128 runs) / F32 95.1 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 64.3 GFLOPS (128 runs) / F32 86.5 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 74.4 GFLOPS ( 35 runs) / F32 39.9 GFLOPS ( 19 runs)
ggml_mul_mat: 2048 x 2048: F16 56.9 GFLOPS ( 4 runs) / F32 31.1 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 56.9 GFLOPS ( 3 runs) / F32 30.1 GFLOPS ( 3 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
Ryzen 5 5500U Ubuntu 22.04.2 AVX2 tiny 4 67 761 0a2d121
Ryzen 5 5500U Ubuntu 22.04.2 AVX2 base 4 96 2040 0a2d121
Ryzen 5 5500U Ubuntu 22.04.2 AVX2 small 4 239 7639 0a2d121
Ryzen 5 5500U Ubuntu 22.04.2 AVX2 medium 4 657 23735 0a2d121
Ryzen 5 5500U Ubuntu 22.04.2 AVX2 large 4 1302 45006 0a2d121

HP Z440, Xeon E5-2690v4, 64Gb, Rocky Linux 9.1

memcpy: 10.94 GB/s
sum: error -536870997.000000

./bench -w 2
ggml_mul_mat: 64 x 64: F16 4.8 GFLOPS (128 runs) / F32 4.8 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 23.1 GFLOPS (128 runs) / F32 18.7 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 52.5 GFLOPS (128 runs) / F32 35.1 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 69.6 GFLOPS (128 runs) / F32 44.4 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 78.8 GFLOPS ( 37 runs) / F32 49.2 GFLOPS ( 23 runs)
ggml_mul_mat: 2048 x 2048: F16 83.6 GFLOPS ( 5 runs) / F32 50.8 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 64.5 GFLOPS ( 3 runs) / F32 21.8 GFLOPS ( 3 runs)

system_info: n_threads = 28 / 28 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

whisper_print_timings: load time = 1031.43 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 13121.63 ms / 1 runs (13121.63 ms per run)
whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 14219.33 ms

model: large

very impressed

CPU OS Config Model Th Load Enc. Commit
MacBook M1 Max macOS 13.0 beta (22A5321d) NEON BLAS medium 8 488 2344 0a2d121
MacBook M1 Max macOS 13.0 beta (22A5321d) NEON BLAS large 8 1070 3209 0a2d121

What am I doing wrong? 17.6 GFlops on a Ryzen 6850H

WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
I whisper.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -mavx -mavx2 -mfma -mf16c -msse3 -DGGML_USE_OPENBLAS -I/usr/local/include/openblas
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:  -lopenblas
I CC:       cc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
I CXX:      g++ (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0

make: 'bench' is up to date.
ggml_mul_mat:    64 x    64: F16     12.6 GFLOPS (128 runs) / F32      9.8 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     19.4 GFLOPS (128 runs) / F32     12.5 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     27.0 GFLOPS (128 runs) / F32     18.4 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     50.3 GFLOPS (128 runs) / F32     28.1 GFLOPS (105 runs)
ggml_mul_mat:  1024 x  1024: F16     59.0 GFLOPS ( 28 runs) / F32     27.0 GFLOPS ( 13 runs)
ggml_mul_mat:  2048 x  2048: F16     43.0 GFLOPS (  3 runs) / F32     11.4 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     17.6 GFLOPS (  3 runs) / F32      6.6 GFLOPS (  3 runs)

MacBook Pro M2 Max 96 GB 16-inch, 2023 13.3.1 (22E261)

I tried running 8 and 12 threads. They were a few ms slower than 4 threads. So the default 4threads is the key it seems.
I also have not compiled anything apple specific. Just git clone and make.

> ./extra/bench-all.sh 8
Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 50.22 GB/s
sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat: 64 x 64: F16 5.0 GFLOPS (128 runs) / F32 4.7 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 46.1 GFLOPS (128 runs) / F32 38.3 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 294.0 GFLOPS (128 runs) / F32 243.7 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 574.5 GFLOPS (128 runs) / F32 272.9 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 736.6 GFLOPS (128 runs) / F32 750.8 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 973.7 GFLOPS ( 57 runs) / F32 993.7 GFLOPS ( 58 runs)
ggml_mul_mat: 4096 x 4096: F16 1554.5 GFLOPS ( 12 runs) / F32 1553.6 GFLOPS ( 12 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
NEON BLAS tiny 8 40 101 c23588c
NEON BLAS base 8 61 223 c23588c
NEON BLAS small 8 154 961 c23588c
NEON BLAS medium 8 436 2534 c23588c
NEON BLAS large 8 867 4100 c23588c

Same hardware as in the post before. I've just tried converting to CoreML models and here are the results. The personal impression of running STT seemed very good - much faster.


./extra/bench-all.sh 4
Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 49.33 GB/s
sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 9.1 GFLOPS (128 runs) / F32 8.2 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 70.7 GFLOPS (128 runs) / F32 77.0 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 350.7 GFLOPS (128 runs) / F32 435.9 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 1060.0 GFLOPS (128 runs) / F32 1254.3 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 1611.0 GFLOPS (128 runs) / F32 1652.4 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 1887.2 GFLOPS (110 runs) / F32 1900.9 GFLOPS (111 runs)
ggml_mul_mat: 4096 x 4096: F16 1806.0 GFLOPS ( 14 runs) / F32 1849.3 GFLOPS ( 14 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
NEON BLAS COREML tiny 4 42 30 c23588c
NEON BLAS COREML base 4 60 49 c23588c
NEON BLAS COREML small 4 151 169 c23588c
NEON BLAS COREML medium 4 430 737 c23588c
NEON BLAS COREML large 4 885 1672 c23588c

Dell 3050 Micro
Running memcpy benchmark with 1 thread
memcpy: 11.49 GB/s
sum: error -536870997.000000

Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 7.7 GFLOPS (128 runs) / F32 3.3 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 27.7 GFLOPS (128 runs) / F32 7.5 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 50.8 GFLOPS (128 runs) / F32 8.8 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 59.4 GFLOPS (128 runs) / F32 9.0 GFLOPS ( 34 runs)
ggml_mul_mat: 1024 x 1024: F16 51.5 GFLOPS ( 24 runs) / F32 8.4 GFLOPS ( 4 runs)
ggml_mul_mat: 2048 x 2048: F16 46.3 GFLOPS ( 3 runs) / F32 8.1 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 47.3 GFLOPS ( 3 runs) / F32 8.1 GFLOPS ( 3 runs)

CPU OS Config Model Th Load Enc. Commit
i3-7100t Ubuntu 22.04 AVX2 tiny 4 84 1125 c23588c
i3-7100t Ubuntu 22.04 AVX2 base 4 128 2616 c23588c
i3-7100t Ubuntu 22.04 AVX2 small 4 339 10127 c23588c
i3-7100t Ubuntu 22.04 AVX2 medium 4 991 39383 c23588c
i3-7100t Ubuntu 22.04 AVX2 large 4 2922 74488 c23588c
j1nx commented

Lenovo thinkcentre m720q

Running memcpy benchmark with 1 thread

memcpy: 6.54 GB/s
sum: error -536870997.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 8.6 GFLOPS (128 runs) / F32 4.5 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 38.8 GFLOPS (128 runs) / F32 7.9 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 76.2 GFLOPS (128 runs) / F32 9.6 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 87.4 GFLOPS (128 runs) / F32 10.0 GFLOPS ( 38 runs)
ggml_mul_mat: 1024 x 1024: F16 89.7 GFLOPS ( 42 runs) / F32 10.1 GFLOPS ( 5 runs)
ggml_mul_mat: 2048 x 2048: F16 67.7 GFLOPS ( 4 runs) / F32 9.1 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 54.7 GFLOPS ( 3 runs) / F32 8.6 GFLOPS ( 3 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
i5-8500T OpenVoiceOS AVX2 tiny.en 4 79 686 70567ef
i5-8500T OpenVoiceOS AVX2 base.en 4 121 1600 70567ef
i5-8500T OpenVoiceOS AVX2 small.en 4 320 6197 70567ef
i5-8500T OpenVoiceOS AVX2 medium.en 4 928 20276 70567ef

Running memcpy benchmark with 1 thread

memcpy: 7.16 GB/s
sum: error -536870997.000000

Running ggml_mul_mat benchmark with 6 threads

ggml_mul_mat: 64 x 64: F16 1.9 GFLOPS (128 runs) / F32 1.8 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 29.7 GFLOPS (128 runs) / F32 7.3 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 65.5 GFLOPS (128 runs) / F32 14.5 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 123.4 GFLOPS (128 runs) / F32 15.2 GFLOPS ( 57 runs)
ggml_mul_mat: 1024 x 1024: F16 127.5 GFLOPS ( 60 runs) / F32 14.7 GFLOPS ( 7 runs)
ggml_mul_mat: 2048 x 2048: F16 93.3 GFLOPS ( 6 runs) / F32 13.3 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 70.0 GFLOPS ( 3 runs) / F32 12.5 GFLOPS ( 3 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
i5-8500T OpenVoiceOS AVX2 tiny.en 6 78 511 70567ef
i5-8500T OpenVoiceOS AVX2 base.en 6 118 1264 70567ef
i5-8500T OpenVoiceOS AVX2 small.en 6 320 4587 70567ef
i5-8500T OpenVoiceOS AVX2 medium.en 6 928 16303 70567ef

Yet another M1 Ultra but look at the bottom, comparision to Const-Me GPU version:
memcpy: 42.66 GB/s
sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 9.1 GFLOPS (128 runs) / F32 7.1 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 68.2 GFLOPS (128 runs) / F32 68.5 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 465.0 GFLOPS (128 runs) / F32 386.2 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 1131.9 GFLOPS (128 runs) / F32 1437.0 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 2188.9 GFLOPS (128 runs) / F32 2519.6 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 2938.8 GFLOPS (128 runs) / F32 2996.5 GFLOPS (128 runs)
ggml_mul_mat: 4096 x 4096: F16 3074.7 GFLOPS ( 23 runs) / F32 3167.2 GFLOPS ( 24 runs)

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| M1 Ultra | Ventura 13.3.1 | NEON BLAS | large | 4 | 858 | 3649 | 70567ef |

Much more interesting i find the comparison i did to a Win10 Core i9 9900K with Nvidia A4000 using the Const-Me Version. I used a 10 minute portion of a "real" tv show (-l de, about 56k tokens known in the model). Note that the power consumption has been measured too, it is not just guessing.

const-me whisper gpu (~450-550W real power consumption while 100% gpu utilisation, cpu is mostly bored)
A4000 1x parallel 93s
A4000 2x parallel both finish at 180s
A4000 4x parallel 3 finish after 317s, 1 finishes at 453s

MACOS, M1 Ultra (70-90W real power consumption while 100% "cpu" utilisation)
whisper cpp - default settings, 1 core, 4 threads
Macos 1x : 155 s
Macos 2x parallel: 196 s - all finish at same time
Macos 4x parallel: 274s - all finish at same time
Macos 6x parallel: 462s - all finish at same time

Also some other tests with different commandline params, on the M1 only, with 1 file:
-p8 (threads default 4) - system unresponsive while processing
120.3 seconds

-p 4 (default threads 4, ~80% cpu utilisation)
79.37545

-bs 2 -p 4
101.01730

-t 16 threads (processors default 1)
148.713

-p 8 -t 2
98.91152

We currently use the Const-me GPU version on Nvidia A5000 because on an intel cpu it delivers much faster results than this cpp version could do. Also it looks like Const-me version does not go anywhere while this repository is vibrant.

As a conclusion i can say that even if i hate it but we are buying this Mac because it delivers faster results, more throughput and all while consuming only 20% of power. Also, it has much better processing power distribution between mutliple parallel processes, i bet i can even use nice to give priorities while on the GPU there are no priorities whatsoever possible.

At our usage amount that means we saved the full costs of the mac (~4000 euros) after 2-3 years of operations (due to lower power costs and A/C) compared to running it on windows/gpu which we bought about the same initial price. Even if i could now safely say we dont need A5000 but just some gamer card for 600 euros, looking at the power costs these days i'd still prefer the mac. (Thanks god i dont need to put it into Active directory or such, so i have an easy time just using it as a slave processing machine)

It would be great if watts idle/peak could be posted as I have been posting benches for RK3588 devices that prob gives the minimum usable results and even then a tad slow.
In that price range I just posted a I3-7100T that was picked up for £64 off ebay which is approx 8 watts idle / 30 peak.
I used to be a bit of an Apple hater in terms of bling tech, but bang for buck the M1 Mini is surprisingly good value and in that race-till-idle likely could process quite a number of zones especially because of diversification of use.

I am on disability so even though cheap the £849.00 for the 16gb version could prob be the basis of the ultimate home-assistant in something similar to https://github.com/ggerganov/whisper.cpp/blob/master/examples/talk-llama/talk-llama.cpp
So likely I will continue posting in the £64 range :)

But what Apple/Arm provide per watt currently is pretty special and for 24/365 in the energy expensive world that is pretty important.
Dunno how many people could post idle & peak wattages also but it would be really interesting especially with CPU vs GPU than just out right speed.

Rock 5b

memcpy: 8.78 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.2 GFLOPS (128 runs) | Q4_1     7.6 GFLOPS (128 runs) | Q4_2     6.9 GFLOPS (128 runs)
  64 x   64: Q5_0     6.8 GFLOPS (128 runs) | Q5_1     7.0 GFLOPS (128 runs) | Q8_0     7.1 GFLOPS (128 runs)
  64 x   64: F16      8.6 GFLOPS (128 runs) | F32      7.5 GFLOPS (128 runs)
 128 x  128: Q4_0    22.8 GFLOPS (128 runs) | Q4_1    22.4 GFLOPS (128 runs) | Q4_2    19.6 GFLOPS (128 runs)
 128 x  128: Q5_0    19.5 GFLOPS (128 runs) | Q5_1    20.7 GFLOPS (128 runs) | Q8_0    22.7 GFLOPS (128 runs)
 128 x  128: F16     28.3 GFLOPS (128 runs) | F32     29.4 GFLOPS (128 runs)
 256 x  256: Q4_0    40.6 GFLOPS (128 runs) | Q4_1    37.6 GFLOPS (128 runs) | Q4_2    30.5 GFLOPS (128 runs)
 256 x  256: Q5_0    31.2 GFLOPS (128 runs) | Q5_1    31.9 GFLOPS (128 runs) | Q8_0    49.1 GFLOPS (128 runs)
 256 x  256: F16     51.8 GFLOPS (128 runs) | F32     36.9 GFLOPS (128 runs)
 512 x  512: Q4_0    52.0 GFLOPS (128 runs) | Q4_1    45.4 GFLOPS (128 runs) | Q4_2    35.7 GFLOPS (128 runs)
 512 x  512: Q5_0    37.4 GFLOPS (128 runs) | Q5_1    36.9 GFLOPS (128 runs) | Q8_0    64.9 GFLOPS (128 runs)
 512 x  512: F16     76.9 GFLOPS (128 runs) | F32     30.7 GFLOPS (115 runs)
1024 x 1024: Q4_0    56.6 GFLOPS ( 27 runs) | Q4_1    47.5 GFLOPS ( 23 runs) | Q4_2    37.5 GFLOPS ( 18 runs)
1024 x 1024: Q5_0    39.5 GFLOPS ( 19 runs) | Q5_1    37.7 GFLOPS ( 18 runs) | Q8_0    71.1 GFLOPS ( 34 runs)
1024 x 1024: F16     49.0 GFLOPS ( 23 runs) | F32     22.4 GFLOPS ( 11 runs)
2048 x 2048: Q4_0    54.2 GFLOPS (  4 runs) | Q4_1    44.6 GFLOPS (  3 runs) | Q4_2    38.5 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.4 GFLOPS (  3 runs) | Q5_1    35.5 GFLOPS (  3 runs) | Q8_0    61.0 GFLOPS (  4 runs)
2048 x 2048: F16     41.3 GFLOPS (  3 runs) | F32     19.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    56.2 GFLOPS (  3 runs) | Q4_1    45.4 GFLOPS (  3 runs) | Q4_2    38.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.7 GFLOPS (  3 runs) | Q5_1    37.3 GFLOPS (  3 runs) | Q8_0    63.2 GFLOPS (  3 runs)
4096 x 4096: F16     40.0 GFLOPS (  3 runs) | F32     17.5 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | tiny | 4 | 102 | 1191 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | base | 4 | 140 | 2861 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | small | 4 | 393 | 10576 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | medium | 4 | 10289 | 36042 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | large | 4 | 2099 | 70740 | be5911a |

How do you get these numbers @StuartIanNaylor ? 😲
Isn't the Rock 5b basically the same as the Orange Pi 5?

Orange Pi 5 8GB:

Running memcpy benchmark

memcpy: 10.14 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     4.7 GFLOPS (128 runs) | Q4_1     4.8 GFLOPS (128 runs) | Q4_2     4.6 GFLOPS (128 runs)
  64 x   64: Q5_0     4.2 GFLOPS (128 runs) | Q5_1     4.4 GFLOPS (128 runs) | Q8_0     4.4 GFLOPS (128 runs)
  64 x   64: F16      4.8 GFLOPS (128 runs) | F32      4.4 GFLOPS (128 runs)
 128 x  128: Q4_0     4.2 GFLOPS (128 runs) | Q4_1     9.8 GFLOPS (128 runs) | Q4_2    10.0 GFLOPS (128 runs)
 128 x  128: Q5_0     8.4 GFLOPS (128 runs) | Q5_1     8.2 GFLOPS (128 runs) | Q8_0    10.3 GFLOPS (128 runs)
 128 x  128: F16     10.3 GFLOPS (128 runs) | F32     10.7 GFLOPS (128 runs)
 256 x  256: Q4_0    34.7 GFLOPS (128 runs) | Q4_1    34.9 GFLOPS (128 runs) | Q4_2    33.9 GFLOPS (128 runs)
 256 x  256: Q5_0    26.2 GFLOPS (128 runs) | Q5_1    24.9 GFLOPS (128 runs) | Q8_0    36.1 GFLOPS (128 runs)
 256 x  256: F16     36.4 GFLOPS (128 runs) | F32     38.4 GFLOPS (128 runs)
 512 x  512: Q4_0    22.2 GFLOPS ( 83 runs) | Q4_1    26.1 GFLOPS ( 98 runs) | Q4_2    35.5 GFLOPS (128 runs)
 512 x  512: Q5_0    42.4 GFLOPS (128 runs) | Q5_1    26.8 GFLOPS (100 runs) | Q8_0    35.8 GFLOPS (128 runs)
 512 x  512: F16     21.6 GFLOPS ( 81 runs) | F32     31.5 GFLOPS (118 runs)
1024 x 1024: Q4_0    32.4 GFLOPS ( 16 runs) | Q4_1    44.1 GFLOPS ( 21 runs) | Q4_2    39.7 GFLOPS ( 19 runs)
1024 x 1024: Q5_0    42.3 GFLOPS ( 20 runs) | Q5_1    40.4 GFLOPS ( 20 runs) | Q8_0    41.2 GFLOPS ( 20 runs)
1024 x 1024: F16     46.8 GFLOPS ( 22 runs) | F32     42.1 GFLOPS ( 20 runs)
2048 x 2048: Q4_0    50.9 GFLOPS (  4 runs) | Q4_1    48.6 GFLOPS (  3 runs) | Q4_2    48.0 GFLOPS (  3 runs)
2048 x 2048: Q5_0    46.7 GFLOPS (  3 runs) | Q5_1    47.8 GFLOPS (  3 runs) | Q8_0    46.4 GFLOPS (  3 runs)
2048 x 2048: F16     46.1 GFLOPS (  3 runs) | F32     44.8 GFLOPS (  3 runs)
4096 x 4096: Q4_0    42.2 GFLOPS (  3 runs) | Q4_1    36.7 GFLOPS (  3 runs) | Q4_2    33.0 GFLOPS (  3 runs)
4096 x 4096: Q5_0    38.5 GFLOPS (  3 runs) | Q5_1    44.7 GFLOPS (  3 runs) | Q8_0    44.7 GFLOPS (  3 runs)
4096 x 4096: F16     44.4 GFLOPS (  3 runs) | F32     44.5 GFLOPS (  3 runs)
CPU OS Config Model Th Load Enc. Commit
RK3588S Armbian 11 - 5.10.110 NEON BLAS tiny 4 193 3748 be5911a
RK3588S Armbian 11 - 5.10.110 NEON BLAS tiny-q5_0 4 156 3341 be5911a
RK3588S Armbian 11 - 5.10.110 NEON BLAS base 4 253 7359 be5911a
RK3588S Armbian 11 - 5.10.110 NEON BLAS base-q5_0 4 178 7307 be5911a

[EDIT: a bit better without OpenBLAS although the GFLOPS are considerably lower O_o]

CPU OS Config Model Th Load Enc. Commit
RK3588S Armbian 11 - 5.10.110 NEON tiny 4 111 3170 be5911a
RK3588S Armbian 11 - 5.10.110 NEON tiny-q5_0 4 205 2817 be5911a
RK3588S Armbian 11 - 5.10.110 NEON base 4 248 6385 be5911a
RK3588S Armbian 11 - 5.10.110 NEON base-q5_0 4 140 6198 be5911a

[EDIT2: getting very unstable results right now 🤔 ]

CPU OS Config Model Th Load Enc. Commit
RK3588S Armbian 11 - 5.10.110 NEON tiny 4 269 1722 be5911a
RK3588S Armbian 11 - 5.10.110 NEON tiny-q5_0 4 104 2746 be5911a
RK3588S Armbian 11 - 5.10.110 NEON base 4 243 7063 be5911a
RK3588S Armbian 11 - 5.10.110 NEON base-q5_0 4 135 6516 be5911a

Likely I don't use Armbian but the supplied server image by Radxa and also the OPI version.
Generally I stay clear of Armbian due to a pet hate of there epic init script that replaces standard installs and /etc and often blind sights me.

I add some tricks and tips I gathered when Radxa do a community board bring up.
I have changed my pref for the scheduler and set it to performance and also and I dunno why but using taskset to make sure it just uses the big cores has a slight perf boost.

So running again I get

memcpy: 8.56 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.3 GFLOPS (128 runs) | Q4_1     7.8 GFLOPS (128 runs) | Q4_2     6.9 GFLOPS (128 runs)
  64 x   64: Q5_0     6.2 GFLOPS (128 runs) | Q5_1     6.7 GFLOPS (128 runs) | Q8_0     7.0 GFLOPS (128 runs)
  64 x   64: F16      2.4 GFLOPS (128 runs) | F32      8.5 GFLOPS (128 runs)
 128 x  128: Q4_0    23.2 GFLOPS (128 runs) | Q4_1    24.1 GFLOPS (128 runs) | Q4_2    19.9 GFLOPS (128 runs)
 128 x  128: Q5_0    15.4 GFLOPS (128 runs) | Q5_1    21.0 GFLOPS (128 runs) | Q8_0    26.6 GFLOPS (128 runs)
 128 x  128: F16     35.0 GFLOPS (128 runs) | F32     28.6 GFLOPS (128 runs)
 256 x  256: Q4_0    41.2 GFLOPS (128 runs) | Q4_1    38.7 GFLOPS (128 runs) | Q4_2    30.5 GFLOPS (128 runs)
 256 x  256: Q5_0    31.2 GFLOPS (128 runs) | Q5_1    31.9 GFLOPS (128 runs) | Q8_0    49.1 GFLOPS (128 runs)
 256 x  256: F16     65.0 GFLOPS (128 runs) | F32     43.5 GFLOPS (128 runs)
 512 x  512: Q4_0    52.0 GFLOPS (128 runs) | Q4_1    45.4 GFLOPS (128 runs) | Q4_2    35.3 GFLOPS (128 runs)
 512 x  512: Q5_0    37.4 GFLOPS (128 runs) | Q5_1    36.8 GFLOPS (128 runs) | Q8_0    64.9 GFLOPS (128 runs)
 512 x  512: F16     78.1 GFLOPS (128 runs) | F32     30.6 GFLOPS (114 runs)
1024 x 1024: Q4_0    56.4 GFLOPS ( 27 runs) | Q4_1    47.4 GFLOPS ( 23 runs) | Q4_2    37.5 GFLOPS ( 18 runs)
1024 x 1024: Q5_0    39.5 GFLOPS ( 19 runs) | Q5_1    37.7 GFLOPS ( 18 runs) | Q8_0    70.8 GFLOPS ( 33 runs)
1024 x 1024: F16     47.2 GFLOPS ( 22 runs) | F32     21.8 GFLOPS ( 11 runs)
2048 x 2048: Q4_0    54.4 GFLOPS (  4 runs) | Q4_1    45.3 GFLOPS (  3 runs) | Q4_2    38.6 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.4 GFLOPS (  3 runs) | Q5_1    35.6 GFLOPS (  3 runs) | Q8_0    59.8 GFLOPS (  4 runs)
2048 x 2048: F16     41.2 GFLOPS (  3 runs) | F32     20.6 GFLOPS (  3 runs)
4096 x 4096: Q4_0    56.9 GFLOPS (  3 runs) | Q4_1    46.6 GFLOPS (  3 runs) | Q4_2    38.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    41.1 GFLOPS (  3 runs) | Q5_1    37.4 GFLOPS (  3 runs) | Q8_0    62.9 GFLOPS (  3 runs)
4096 x 4096: F16     39.8 GFLOPS (  3 runs) | F32     17.6 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 96 | 1199 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 137 | 2875 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 343 | 10635 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 1013 | 35174 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 2019 | 71678 | be5911a |

If I run without previously echo performance | tee /sys/bus/cpu/devices/cpu[046]/cpufreq/scaling_governor /sys/class/devfreq/dmc/governor as the rk3588[x] is a tri-cluster 4-2-2 and dunno about the dmc but it was something we where using at that time.
Prefix (taskset -c 4-7) to further enforce not using the efficiency cores.

The ondemand governor seems to load balance whilst at least Whisper.cpp a race-till-idle more like how Android is set up does seem to give a perf boost with little loss in efficiency, if none.

Without bench gives

memcpy: 7.82 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     3.1 GFLOPS (128 runs) | Q4_1     2.8 GFLOPS (128 runs) | Q                                                                                                          4_2     2.4 GFLOPS (128 runs)
  64 x   64: Q5_0     2.3 GFLOPS (128 runs) | Q5_1     2.2 GFLOPS (128 runs) | Q                                                                                                          8_0     2.7 GFLOPS (128 runs)
  64 x   64: F16      3.1 GFLOPS (128 runs) | F32      2.6 GFLOPS (128 runs)
 128 x  128: Q4_0     7.1 GFLOPS (128 runs) | Q4_1     7.0 GFLOPS (128 runs) | Q                                                                                                          4_2     6.2 GFLOPS (128 runs)
 128 x  128: Q5_0     5.4 GFLOPS (128 runs) | Q5_1     5.4 GFLOPS (128 runs) | Q                                                                                                          8_0     7.2 GFLOPS (128 runs)
 128 x  128: F16      9.3 GFLOPS (128 runs) | F32      5.9 GFLOPS (128 runs)
 256 x  256: Q4_0    10.1 GFLOPS (128 runs) | Q4_1     9.5 GFLOPS (128 runs) | Q                                                                                                          4_2     8.4 GFLOPS (128 runs)
 256 x  256: Q5_0     7.4 GFLOPS (128 runs) | Q5_1     6.9 GFLOPS (128 runs) | Q                                                                                                          8_0    10.9 GFLOPS (128 runs)
 256 x  256: F16     13.4 GFLOPS (128 runs) | F32      7.9 GFLOPS (128 runs)
 512 x  512: Q4_0    10.9 GFLOPS ( 41 runs) | Q4_1    10.4 GFLOPS ( 39 runs) | Q                                                                                                          4_2     8.5 GFLOPS ( 32 runs)
 512 x  512: Q5_0     8.9 GFLOPS ( 34 runs) | Q5_1     8.2 GFLOPS ( 31 runs) | Q                                                                                                          8_0    12.1 GFLOPS ( 46 runs)
 512 x  512: F16     14.5 GFLOPS ( 54 runs) | F32      8.7 GFLOPS ( 33 runs)
1024 x 1024: Q4_0    26.9 GFLOPS ( 13 runs) | Q4_1    24.9 GFLOPS ( 12 runs) | Q                                                                                                          4_2    21.7 GFLOPS ( 11 runs)
1024 x 1024: Q5_0    23.0 GFLOPS ( 11 runs) | Q5_1    22.0 GFLOPS ( 11 runs) | Q                                                                                                          8_0    29.1 GFLOPS ( 14 runs)
1024 x 1024: F16     28.2 GFLOPS ( 14 runs) | F32     17.9 GFLOPS (  9 runs)
2048 x 2048: Q4_0    50.1 GFLOPS (  3 runs) | Q4_1    41.3 GFLOPS (  3 runs) | Q                                                                                                          4_2    36.7 GFLOPS (  3 runs)
2048 x 2048: Q5_0    36.0 GFLOPS (  3 runs) | Q5_1    33.2 GFLOPS (  3 runs) | Q                                                                                                          8_0    53.7 GFLOPS (  4 runs)
2048 x 2048: F16     37.5 GFLOPS (  3 runs) | F32     19.3 GFLOPS (  3 runs)
4096 x 4096: Q4_0    55.7 GFLOPS (  3 runs) | Q4_1    43.7 GFLOPS (  3 runs) | Q                                                                                                          4_2    39.4 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.5 GFLOPS (  3 runs) | Q5_1    36.1 GFLOPS (  3 runs) | Q                                                                                                          8_0    65.8 GFLOPS (  3 runs)
4096 x 4096: F16     36.8 GFLOPS (  3 runs) | F32     18.5 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 171 | 1817 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 255 | 3529 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 433 | 11208 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 1814 | 36829 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 36647 | 71393 | be5911a |

I will tack on the OPI5 next as think it is a smidge faster.
So without again

memcpy: 8.26 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     3.1 GFLOPS (128 runs) | Q4_1     3.3 GFLOPS (128 runs) | Q4_2     3.4 GFLOPS (128 runs)
  64 x   64: Q5_0     1.7 GFLOPS (128 runs) | Q5_1     3.1 GFLOPS (128 runs) | Q8_0     2.9 GFLOPS (128 runs)
  64 x   64: F16      4.0 GFLOPS (128 runs) | F32      3.5 GFLOPS (128 runs)
 128 x  128: Q4_0     7.8 GFLOPS (128 runs) | Q4_1     6.6 GFLOPS (128 runs) | Q4_2     6.7 GFLOPS (128 runs)
 128 x  128: Q5_0     5.6 GFLOPS (128 runs) | Q5_1     5.4 GFLOPS (128 runs) | Q8_0     8.7 GFLOPS (128 runs)
 128 x  128: F16     10.1 GFLOPS (128 runs) | F32      6.3 GFLOPS (128 runs)
 256 x  256: Q4_0    10.5 GFLOPS (128 runs) | Q4_1     9.1 GFLOPS (128 runs) | Q4_2     7.9 GFLOPS (128 runs)
 256 x  256: Q5_0     7.0 GFLOPS (128 runs) | Q5_1     6.7 GFLOPS (128 runs) | Q8_0    12.6 GFLOPS (128 runs)
 256 x  256: F16     12.6 GFLOPS (128 runs) | F32      7.5 GFLOPS (128 runs)
 512 x  512: Q4_0    11.9 GFLOPS ( 45 runs) | Q4_1    10.8 GFLOPS ( 41 runs) | Q4_2    10.0 GFLOPS ( 38 runs)
 512 x  512: Q5_0     8.5 GFLOPS ( 32 runs) | Q5_1     7.9 GFLOPS ( 30 runs) | Q8_0    14.5 GFLOPS ( 54 runs)
 512 x  512: F16     14.2 GFLOPS ( 53 runs) | F32      8.3 GFLOPS ( 32 runs)
1024 x 1024: Q4_0    30.4 GFLOPS ( 15 runs) | Q4_1    28.9 GFLOPS ( 14 runs) | Q4_2    23.6 GFLOPS ( 11 runs)
1024 x 1024: Q5_0    23.0 GFLOPS ( 11 runs) | Q5_1    23.5 GFLOPS ( 12 runs) | Q8_0    37.4 GFLOPS ( 18 runs)
1024 x 1024: F16     33.9 GFLOPS ( 16 runs) | F32     18.0 GFLOPS (  9 runs)
2048 x 2048: Q4_0    51.4 GFLOPS (  4 runs) | Q4_1    42.5 GFLOPS (  3 runs) | Q4_2    36.5 GFLOPS (  3 runs)
2048 x 2048: Q5_0    36.0 GFLOPS (  3 runs) | Q5_1    32.7 GFLOPS (  3 runs) | Q8_0    59.0 GFLOPS (  4 runs)
2048 x 2048: F16     39.4 GFLOPS (  3 runs) | F32     17.5 GFLOPS (  3 runs)
4096 x 4096: Q4_0    58.8 GFLOPS (  3 runs) | Q4_1    47.0 GFLOPS (  3 runs) | Q4_2    39.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.8 GFLOPS (  3 runs) | Q5_1    37.3 GFLOPS (  3 runs) | Q8_0    65.1 GFLOPS (  3 runs)
4096 x 4096: F16     40.6 GFLOPS (  3 runs) | F32     18.6 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 133 | 1235 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 232 | 2941 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 470 | 10870 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 23195 | 36162 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 46511 | 90187 | be5911a |

Then as sudo orangepi-config set the perf governor (no dmc)
taskset -c 4-7 ,/extra/bench-all.sh

memcpy: 8.22 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     0.7 GFLOPS (128 runs) | Q4_1     1.6 GFLOPS (128 runs) | Q                                                                                                  4_2     1.0 GFLOPS (128 runs)
  64 x   64: Q5_0     0.6 GFLOPS (128 runs) | Q5_1     0.8 GFLOPS (128 runs) | Q                                                                                                  8_0     1.4 GFLOPS (128 runs)
  64 x   64: F16      1.9 GFLOPS (128 runs) | F32      0.8 GFLOPS (128 runs)
 128 x  128: Q4_0     8.9 GFLOPS (128 runs) | Q4_1     3.8 GFLOPS (128 runs) | Q                                                                                                  4_2     3.1 GFLOPS (128 runs)
 128 x  128: Q5_0     5.8 GFLOPS (128 runs) | Q5_1     3.8 GFLOPS (128 runs) | Q                                                                                                  8_0     7.8 GFLOPS (128 runs)
 128 x  128: F16      5.2 GFLOPS (128 runs) | F32      3.6 GFLOPS (128 runs)
 256 x  256: Q4_0    13.1 GFLOPS (128 runs) | Q4_1    12.1 GFLOPS (128 runs) | Q                                                                                                  4_2    12.1 GFLOPS (128 runs)
 256 x  256: Q5_0    12.8 GFLOPS (128 runs) | Q5_1    13.4 GFLOPS (128 runs) | Q                                                                                                  8_0    17.9 GFLOPS (128 runs)
 256 x  256: F16     17.6 GFLOPS (128 runs) | F32     11.0 GFLOPS (128 runs)
 512 x  512: Q4_0    33.3 GFLOPS (125 runs) | Q4_1    34.7 GFLOPS (128 runs) | Q                                                                                                  4_2    21.9 GFLOPS ( 82 runs)
 512 x  512: Q5_0    21.4 GFLOPS ( 80 runs) | Q5_1    22.4 GFLOPS ( 84 runs) | Q                                                                                                  8_0    35.2 GFLOPS (128 runs)
 512 x  512: F16     37.1 GFLOPS (128 runs) | F32     23.2 GFLOPS ( 87 runs)
1024 x 1024: Q4_0    54.9 GFLOPS ( 26 runs) | Q4_1    44.3 GFLOPS ( 21 runs) | Q                                                                                                  4_2    31.4 GFLOPS ( 15 runs)
1024 x 1024: Q5_0    35.7 GFLOPS ( 17 runs) | Q5_1    32.1 GFLOPS ( 15 runs) | Q                                                                                                  8_0    66.5 GFLOPS ( 31 runs)
1024 x 1024: F16     45.0 GFLOPS ( 21 runs) | F32     19.6 GFLOPS ( 10 runs)
2048 x 2048: Q4_0    54.6 GFLOPS (  4 runs) | Q4_1    45.2 GFLOPS (  3 runs) | Q                                                                                                  4_2    38.4 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.9 GFLOPS (  3 runs) | Q5_1    34.7 GFLOPS (  3 runs) | Q                                                                                                  8_0    59.9 GFLOPS (  4 runs)
2048 x 2048: F16     40.5 GFLOPS (  3 runs) | F32     20.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    59.5 GFLOPS (  3 runs) | Q4_1    47.7 GFLOPS (  3 runs) | Q                                                                                                  4_2    40.1 GFLOPS (  3 runs)
4096 x 4096: Q5_0    42.7 GFLOPS (  3 runs) | Q5_1    39.6 GFLOPS (  3 runs) | Q                                                                                                  8_0    60.7 GFLOPS (  3 runs)
4096 x 4096: F16     35.5 GFLOPS (  3 runs) | F32     20.8 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 119 | 1178 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 168 | 2910 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 399 | 10784 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 23469 | 35952 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 47147 | 76405 | be5911a |

I ran that again as think transformers do bounce around abit to end up with the same tokens.

memcpy: 9.46 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.1 GFLOPS (128 runs) | Q4_1     7.6 GFLOPS (128 runs) | Q4_2     6.6 GFLOPS (128 runs)
  64 x   64: Q5_0     6.3 GFLOPS (128 runs) | Q5_1     6.9 GFLOPS (128 runs) | Q8_0     6.6 GFLOPS (128 runs)
  64 x   64: F16      7.8 GFLOPS (128 runs) | F32      7.3 GFLOPS (128 runs)
 128 x  128: Q4_0    23.8 GFLOPS (128 runs) | Q4_1    25.0 GFLOPS (128 runs) | Q4_2     8.5 GFLOPS (128 runs)
 128 x  128: Q5_0    19.1 GFLOPS (128 runs) | Q5_1    20.8 GFLOPS (128 runs) | Q8_0    26.4 GFLOPS (128 runs)
 128 x  128: F16     34.8 GFLOPS (128 runs) | F32     28.6 GFLOPS (128 runs)
 256 x  256: Q4_0    43.4 GFLOPS (128 runs) | Q4_1    42.0 GFLOPS (128 runs) | Q4_2    31.3 GFLOPS (128 runs)
 256 x  256: Q5_0    30.5 GFLOPS (128 runs) | Q5_1    32.0 GFLOPS (128 runs) | Q8_0    41.7 GFLOPS (128 runs)
 256 x  256: F16     60.0 GFLOPS (128 runs) | F32     42.9 GFLOPS (128 runs)
 512 x  512: Q4_0    56.5 GFLOPS (128 runs) | Q4_1    49.5 GFLOPS (128 runs) | Q4_2    36.6 GFLOPS (128 runs)
 512 x  512: Q5_0    36.7 GFLOPS (128 runs) | Q5_1    36.8 GFLOPS (128 runs) | Q8_0    69.9 GFLOPS (128 runs)
 512 x  512: F16     78.5 GFLOPS (128 runs) | F32     30.1 GFLOPS (113 runs)
1024 x 1024: Q4_0    62.7 GFLOPS ( 30 runs) | Q4_1    52.2 GFLOPS ( 25 runs) | Q4_2    38.9 GFLOPS ( 19 runs)
1024 x 1024: Q5_0    39.2 GFLOPS ( 19 runs) | Q5_1    38.2 GFLOPS ( 18 runs) | Q8_0    76.2 GFLOPS ( 36 runs)
1024 x 1024: F16     46.7 GFLOPS ( 22 runs) | F32     21.6 GFLOPS ( 11 runs)
2048 x 2048: Q4_0    60.4 GFLOPS (  4 runs) | Q4_1    50.3 GFLOPS (  3 runs) | Q4_2    39.6 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.9 GFLOPS (  3 runs) | Q5_1    35.4 GFLOPS (  3 runs) | Q8_0    66.5 GFLOPS (  4 runs)
2048 x 2048: F16     33.8 GFLOPS (  3 runs) | F32     15.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    64.2 GFLOPS (  3 runs) | Q4_1    51.2 GFLOPS (  3 runs) | Q4_2    40.2 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.7 GFLOPS (  3 runs) | Q5_1    37.2 GFLOPS (  3 runs) | Q8_0    71.5 GFLOPS (  3 runs)
4096 x 4096: F16     38.5 GFLOPS (  3 runs) | F32     20.3 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 103 | 1166 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 152 | 2888 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 379 | 10892 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 22649 | 35767 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 45427 | 73967 | be5911a |

But don't seem to get that much variance, race-till-idle is just preference.

Prefix (taskset -c 4-7) to further enforce not using the efficiency cores.

Tried that played with the CPU settings (performance mode etc.), even added some better cooling but it still keeps jumping all over the place with the tiny model at ~2s (in the good runs) while 'htop' shows consistent 100% load on the performance cores. Q5 models are sometimes a few ms faster sometimes slower.
When I do the same tests with the CTranslate2 Whisper version results are pretty stable and always about twice as fast.

Dunno just to show the next run is very consistant and considerabilly faster... ?

memcpy: 10.52 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     2.5 GFLOPS (128 runs) | Q4_1     2.5 GFLOPS (128 runs) | Q4_2     1.3 GFLOPS (128 runs)
  64 x   64: Q5_0     1.0 GFLOPS (128 runs) | Q5_1     0.6 GFLOPS (128 runs) | Q8_0     0.8 GFLOPS (128 runs)
  64 x   64: F16      1.0 GFLOPS (128 runs) | F32      1.8 GFLOPS (128 runs)
 128 x  128: Q4_0     2.8 GFLOPS (128 runs) | Q4_1     2.2 GFLOPS (128 runs) | Q4_2     6.7 GFLOPS (128 runs)
 128 x  128: Q5_0     3.2 GFLOPS (128 runs) | Q5_1     5.5 GFLOPS (128 runs) | Q8_0     3.0 GFLOPS (128 runs)
 128 x  128: F16     11.2 GFLOPS (128 runs) | F32      8.5 GFLOPS (128 runs)
 256 x  256: Q4_0    13.5 GFLOPS (128 runs) | Q4_1     8.8 GFLOPS (128 runs) | Q4_2     9.9 GFLOPS (128 runs)
 256 x  256: Q5_0    10.7 GFLOPS (128 runs) | Q5_1     6.7 GFLOPS (128 runs) | Q8_0     7.3 GFLOPS (128 runs)
 256 x  256: F16     18.3 GFLOPS (128 runs) | F32     10.1 GFLOPS (128 runs)
 512 x  512: Q4_0    36.4 GFLOPS (128 runs) | Q4_1    31.2 GFLOPS (117 runs) | Q4_2    19.0 GFLOPS ( 71 runs)
 512 x  512: Q5_0    18.5 GFLOPS ( 69 runs) | Q5_1    20.4 GFLOPS ( 77 runs) | Q8_0    30.7 GFLOPS (115 runs)
 512 x  512: F16     33.8 GFLOPS (126 runs) | F32     20.7 GFLOPS ( 79 runs)
1024 x 1024: Q4_0    40.0 GFLOPS ( 19 runs) | Q4_1    36.4 GFLOPS ( 18 runs) | Q4_2    29.6 GFLOPS ( 14 runs)
1024 x 1024: Q5_0    32.9 GFLOPS ( 16 runs) | Q5_1    30.6 GFLOPS ( 15 runs) | Q8_0    54.2 GFLOPS ( 26 runs)
1024 x 1024: F16     44.1 GFLOPS ( 21 runs) | F32     20.0 GFLOPS ( 10 runs)
2048 x 2048: Q4_0    57.7 GFLOPS (  4 runs) | Q4_1    47.7 GFLOPS (  3 runs) | Q4_2    38.7 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.8 GFLOPS (  3 runs) | Q5_1    35.1 GFLOPS (  3 runs) | Q8_0    63.6 GFLOPS (  4 runs)
2048 x 2048: F16     33.6 GFLOPS (  3 runs) | F32     14.8 GFLOPS (  3 runs)
4096 x 4096: Q4_0    61.9 GFLOPS (  3 runs) | Q4_1    50.2 GFLOPS (  3 runs) | Q4_2    38.8 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.6 GFLOPS (  3 runs) | Q5_1    37.9 GFLOPS (  3 runs) | Q8_0    70.4 GFLOPS (  3 runs)
4096 x 4096: F16     38.0 GFLOPS (  3 runs) | F32     20.8 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 134 | 1176 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 179 | 2964 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 416 | 11037 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 23462 | 36469 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 47286 | 77494 | be5911a |

System76 Pangolin (pang12) w/ Ryzen 7 6800U (8c16t) @ 2.7GHz + 32GB DDR5 at 6400MT/s
Models stored on a Samsung 970 Evo Plus

Running memcpy benchmark with 1 thread

memcpy: 11.18 GB/s
sum:    error -536870997.000000

Running ggml_mul_mat benchmark with 16 threads

ggml_mul_mat:   64 x   64: Q4_0     0.9 GFLOPS (128 runs) / Q4_1     0.4 GFLOPS (128 runs) / F16     1.2 GFLOPS (128 runs) / F32     1.2 GFLOPS (128 runs)
ggml_mul_mat:  128 x  128: Q4_0     6.1 GFLOPS (128 runs) / Q4_1     7.5 GFLOPS (128 runs) / F16     4.6 GFLOPS (128 runs) / F32    10.0 GFLOPS (128 runs)
ggml_mul_mat:  256 x  256: Q4_0    26.2 GFLOPS (128 runs) / Q4_1    42.3 GFLOPS (128 runs) / F16    19.9 GFLOPS (128 runs) / F32    47.9 GFLOPS (128 runs)
ggml_mul_mat:  512 x  512: Q4_0    66.6 GFLOPS (128 runs) / Q4_1    98.6 GFLOPS (128 runs) / F16    90.1 GFLOPS (128 runs) / F32   110.4 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: Q4_0    97.8 GFLOPS ( 46 runs) / Q4_1   154.3 GFLOPS ( 72 runs) / F16   158.7 GFLOPS ( 74 runs) / F32   132.2 GFLOPS ( 62 runs)
ggml_mul_mat: 2048 x 2048: Q4_0   126.7 GFLOPS (  8 runs) / Q4_1   164.8 GFLOPS ( 10 runs) / F16   164.1 GFLOPS ( 10 runs) / F32    96.4 GFLOPS (  6 runs)
ggml_mul_mat: 4096 x 4096: Q4_0   138.6 GFLOPS (  3 runs) / Q4_1   166.9 GFLOPS (  3 runs) / F16   136.0 GFLOPS (  3 runs) / F32    62.8 GFLOPS (  3 runs)
CPU OS Config Model Th Load Enc. Commit
Ryzen 7 6800U Arch Linux AVX2 tiny 16 37 510 9c61f5f
Ryzen 7 6800U Arch Linux AVX2 base 16 51 1222 9c61f5f
Ryzen 7 6800U Arch Linux AVX2 small 16 123 4283 9c61f5f
Ryzen 7 6800U Arch Linux AVX2 medium 16 341 14178 9c61f5f
Ryzen 7 6800U Arch Linux AVX2 large 16 650 25801 9c61f5f

MacBook Air M2 24GB 2022 (CoreML model)

It is interesting to note that when converted to a CoreML model and executed, even a Macbook Air M2 has a processing speed close to that of a high-spec Mac, perhaps because the specifications of the Neural engine are the same for the same generation of Apple Silicon.

./extra/bench-all.sh 4
Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 34.33 GB/s
sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 11.4 GFLOPS (128 runs) / F32 10.5 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 89.0 GFLOPS (128 runs) / F32 74.8 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 422.6 GFLOPS (128 runs) / F32 419.4 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 793.4 GFLOPS (128 runs) / F32 801.8 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 827.0 GFLOPS (128 runs) / F32 849.3 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 821.8 GFLOPS ( 48 runs) / F32 773.4 GFLOPS ( 46 runs)
ggml_mul_mat: 4096 x 4096: F16 765.2 GFLOPS ( 6 runs) / F32 743.6 GFLOPS ( 6 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
NEON BLAS COREML tiny 4 c23588c
NEON BLAS COREML base 4 c23588c
M2 13.3.1 (a)(22E772610a) NEON BLAS COREML small 4 153 199 c23588c
M2 13.3.1 (a)(22E772610a) NEON BLAS COREML medium 4 450 746 c23588c
M2 13.3.1 (a)(22E772610a) NEON BLAS COREML large 4 1053 1439 c23588c
CPU OS Config Model Th Load Enc. Commit
Raspberry Pi 4 2GB  Bullseye 6.1.21-v8+  OPENBLAS tiny.en 4 393  7882  14bee39
Raspberry Pi 4 2GB  Bullseye 6.1.21-v8+  OPENBLAS tiny.en-q5 4 265  8564  14bee39
Raspberry Pi 4 2GB  Bullseye 6.1.21-v8+  OPENBLAS base.en 4 571  16328  14bee39
Raspberry Pi 4 2GB  Bullseye 6.1.21-v8+  OPENBLAS base.en-q5 4 306  16169  14bee39

Tests performed using Raspberry Pi OS libopenblas-dev package (version 0.3.13+ds-3).

Ryzen 3 2200GE (Lenovo M715q)

Running memcpy benchmark

memcpy: 12.14 GB/s (1 thread)
sum:    -536869898.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     5.3 GFLOPS (128 runs) | Q4_1     1.6 GFLOPS (128 runs) | Q4_2     5.2 GFLOPS (128 runs)
  64 x   64: Q5_0     5.5 GFLOPS (128 runs) | Q5_1     1.7 GFLOPS (128 runs) | Q8_0     1.7 GFLOPS (128 runs)
  64 x   64: F16      1.1 GFLOPS (128 runs) | F32      2.0 GFLOPS (128 runs)
 128 x  128: Q4_0     9.9 GFLOPS (128 runs) | Q4_1    10.8 GFLOPS (128 runs) | Q4_2     9.8 GFLOPS (128 runs)
 128 x  128: Q5_0    16.7 GFLOPS (128 runs) | Q5_1    19.0 GFLOPS (128 runs) | Q8_0    20.6 GFLOPS (128 runs)
 128 x  128: F16      9.4 GFLOPS (128 runs) | F32     29.8 GFLOPS (128 runs)
 256 x  256: Q4_0    26.1 GFLOPS (128 runs) | Q4_1    29.4 GFLOPS (128 runs) | Q4_2    31.2 GFLOPS (128 runs)
 256 x  256: Q5_0    28.4 GFLOPS (128 runs) | Q5_1    31.0 GFLOPS (128 runs) | Q8_0    32.5 GFLOPS (128 runs)
 256 x  256: F16     21.5 GFLOPS (128 runs) | F32     41.6 GFLOPS (128 runs)
 512 x  512: Q4_0    41.4 GFLOPS (128 runs) | Q4_1    42.7 GFLOPS (128 runs) | Q4_2    43.2 GFLOPS (128 runs)
 512 x  512: Q5_0    39.2 GFLOPS (128 runs) | Q5_1    37.2 GFLOPS (128 runs) | Q8_0    56.7 GFLOPS (128 runs)
 512 x  512: F16     29.3 GFLOPS (110 runs) | F32     56.0 GFLOPS (128 runs)
1024 x 1024: Q4_0    52.5 GFLOPS ( 25 runs) | Q4_1    51.6 GFLOPS ( 25 runs) | Q4_2    48.3 GFLOPS ( 23 runs)
1024 x 1024: Q5_0    44.1 GFLOPS ( 21 runs) | Q5_1    41.9 GFLOPS ( 20 runs) | Q8_0    71.4 GFLOPS ( 34 runs)
1024 x 1024: F16     30.4 GFLOPS ( 15 runs) | F32     35.5 GFLOPS ( 17 runs)
2048 x 2048: Q4_0    54.6 GFLOPS (  4 runs) | Q4_1    50.6 GFLOPS (  3 runs) | Q4_2    49.8 GFLOPS (  3 runs)
2048 x 2048: Q5_0    44.8 GFLOPS (  3 runs) | Q5_1    40.8 GFLOPS (  3 runs) | Q8_0    67.1 GFLOPS (  4 runs)
2048 x 2048: F16     29.1 GFLOPS (  3 runs) | F32     20.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    54.3 GFLOPS (  3 runs) | Q4_1    50.0 GFLOPS (  3 runs) | Q4_2    49.5 GFLOPS (  3 runs)
4096 x 4096: Q5_0    44.7 GFLOPS (  3 runs) | Q5_1    40.2 GFLOPS (  3 runs) | Q8_0    64.0 GFLOPS (  3 runs)
4096 x 4096: F16     28.3 GFLOPS (  3 runs) | F32     19.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | tiny | 4 | 68 | 1676 | 2b6a074 |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | base | 4 | 96 | 3850 | 2b6a074 |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | small | 4 | 235 | 14734 | 2b6a074 |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | medium | 4 | 660 | 49288 | 2b6a074 |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | large | 4 | 1302 | 105757 | 2b6a074 |

This is what I get with clblast on an AMD RX6700XT:

Running memcpy benchmark

memcpy: 11.94 GB/s (1 thread)
sum: -536869898.000000

Running ggml_mul_mat benchmark with 16 threads

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx1031
64 x 64: Q4_0 0.8 GFLOPS (128 runs) | Q4_1 0.8 GFLOPS (128 runs)
64 x 64: Q5_0 0.8 GFLOPS (128 runs) | Q5_1 0.8 GFLOPS (128 runs) | Q8_0 0.8 GFLOPS (128 runs)
64 x 64: F16 0.8 GFLOPS (128 runs) | F32 0.8 GFLOPS (128 runs)
128 x 128: Q4_0 5.6 GFLOPS (128 runs) | Q4_1 5.6 GFLOPS (128 runs)
128 x 128: Q5_0 6.1 GFLOPS (128 runs) | Q5_1 5.7 GFLOPS (128 runs) | Q8_0 6.1 GFLOPS (128 runs)
128 x 128: F16 5.8 GFLOPS (128 runs) | F32 6.0 GFLOPS (128 runs)
256 x 256: Q4_0 43.4 GFLOPS (128 runs) | Q4_1 40.3 GFLOPS (128 runs)
256 x 256: Q5_0 38.2 GFLOPS (128 runs) | Q5_1 39.2 GFLOPS (128 runs) | Q8_0 39.0 GFLOPS (128 runs)
256 x 256: F16 38.3 GFLOPS (128 runs) | F32 38.6 GFLOPS (128 runs)
512 x 512: Q4_0 210.9 GFLOPS (128 runs) | Q4_1 212.8 GFLOPS (128 runs)
512 x 512: Q5_0 212.0 GFLOPS (128 runs) | Q5_1 213.2 GFLOPS (128 runs) | Q8_0 210.2 GFLOPS (128 runs)
512 x 512: F16 195.5 GFLOPS (128 runs) | F32 208.7 GFLOPS (128 runs)
1024 x 1024: Q4_0 1280.6 GFLOPS (128 runs) | Q4_1 1289.0 GFLOPS (128 runs)
1024 x 1024: Q5_0 1292.2 GFLOPS (128 runs) | Q5_1 1287.4 GFLOPS (128 runs) | Q8_0 1271.0 GFLOPS (128 runs)
1024 x 1024: F16 1025.9 GFLOPS (128 runs) | F32 1227.8 GFLOPS (128 runs)
2048 x 2048: Q4_0 3423.2 GFLOPS (128 runs) | Q4_1 3414.1 GFLOPS (128 runs)
2048 x 2048: Q5_0 3393.6 GFLOPS (128 runs) | Q5_1 3385.8 GFLOPS (128 runs) | Q8_0 3385.2 GFLOPS (128 runs)
2048 x 2048: F16 2434.4 GFLOPS (128 runs) | F32 3045.8 GFLOPS (128 runs)
4096 x 4096: Q4_0 4187.6 GFLOPS ( 31 runs) | Q4_1 4193.6 GFLOPS ( 31 runs)
4096 x 4096: Q5_0 4204.3 GFLOPS ( 31 runs) | Q5_1 4187.1 GFLOPS ( 31 runs) | Q8_0 4135.0 GFLOPS ( 31 runs)
4096 x 4096: F16 3491.1 GFLOPS ( 26 runs) | F32 3911.3 GFLOPS ( 29 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
Ryzen 5950X / RX6700XT Arch AVX2 BLAS tiny 16 382 603 95b02d7
Ryzen 5950X / RX6700XT Arch AVX2 BLAS base 16 371 717 95b02d7
Ryzen 5950X / RX6700XT Arch AVX2 BLAS small 16 427 1271 95b02d7
Ryzen 5950X / RX6700XT Arch AVX2 BLAS medium 16 636 2784 95b02d7
Ryzen 5950X / RX6700XT Arch AVX2 BLAS large 16 868 4308 95b02d7

Thinkpad T480, Core i7 8550U

Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy: 12.67 GB/s (1 thread)
sum:    -536869898.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     6.1 GFLOPS (128 runs) | Q4_1     6.4 GFLOPS (128 runs)
  64 x   64: Q5_0     6.6 GFLOPS (128 runs) | Q5_1     6.7 GFLOPS (128 runs) | Q8_0     6.3 GFLOPS (128 runs)
  64 x   64: F16      7.8 GFLOPS (128 runs) | F32      5.4 GFLOPS (128 runs)
 128 x  128: Q4_0    25.3 GFLOPS (128 runs) | Q4_1    25.5 GFLOPS (128 runs)
 128 x  128: Q5_0    29.6 GFLOPS (128 runs) | Q5_1    26.9 GFLOPS (128 runs) | Q8_0    31.7 GFLOPS (128 runs)
 128 x  128: F16     34.8 GFLOPS (128 runs) | F32     13.8 GFLOPS (128 runs)
 256 x  256: Q4_0    49.9 GFLOPS (128 runs) | Q4_1    43.3 GFLOPS (128 runs)
 256 x  256: Q5_0    46.6 GFLOPS (128 runs) | Q5_1    45.4 GFLOPS (128 runs) | Q8_0    64.0 GFLOPS (128 runs)
 256 x  256: F16     61.2 GFLOPS (128 runs) | F32     18.7 GFLOPS (128 runs)
 512 x  512: Q4_0    66.7 GFLOPS (128 runs) | Q4_1    54.7 GFLOPS (128 runs)
 512 x  512: Q5_0    53.5 GFLOPS (128 runs) | Q5_1    57.9 GFLOPS (128 runs) | Q8_0    80.6 GFLOPS (128 runs)
 512 x  512: F16     65.5 GFLOPS (128 runs) | F32     22.2 GFLOPS ( 83 runs)
1024 x 1024: Q4_0    77.7 GFLOPS ( 37 runs) | Q4_1    66.9 GFLOPS ( 32 runs)
1024 x 1024: Q5_0    66.3 GFLOPS ( 31 runs) | Q5_1    60.2 GFLOPS ( 29 runs) | Q8_0    91.6 GFLOPS ( 44 runs)
1024 x 1024: F16     63.8 GFLOPS ( 30 runs) | F32     21.2 GFLOPS ( 10 runs)
2048 x 2048: Q4_0    74.3 GFLOPS (  5 runs) | Q4_1    71.1 GFLOPS (  5 runs)
2048 x 2048: Q5_0    59.5 GFLOPS (  4 runs) | Q5_1    56.4 GFLOPS (  4 runs) | Q8_0    90.2 GFLOPS (  6 runs)
2048 x 2048: F16     49.9 GFLOPS (  3 runs) | F32     15.9 GFLOPS (  3 runs)
4096 x 4096: Q4_0    61.1 GFLOPS (  3 runs) | Q4_1    54.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    48.4 GFLOPS (  3 runs) | Q5_1    45.1 GFLOPS (  3 runs) | Q8_0    62.7 GFLOPS (  3 runs)
4096 x 4096: F16     38.4 GFLOPS (  3 runs) | F32     12.9 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |

I don't know why it stopped when it wanted to run the benchmark for all models? I have ggml-base.en.bin, and I have for-tests-ggml*.bin.

@randomshinichi That is what its does when the non en models are not avail

Jetson Orin Nano (Developer Kit) - Unoptimised install (no CLBlast, CUBLAS etc)

Running memcpy benchmark

memcpy: 6.28 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     4.1 GFLOPS (128 runs) | Q4_1     4.2 GFLOPS (128 runs)
  64 x   64: Q5_0     4.2 GFLOPS (128 runs) | Q5_1     4.1 GFLOPS (128 runs) | Q8_0     4.6 GFLOPS (128 runs)
  64 x   64: F16      4.0 GFLOPS (128 runs) | F32      5.2 GFLOPS (128 runs)
 128 x  128: Q4_0    12.9 GFLOPS (128 runs) | Q4_1    13.2 GFLOPS (128 runs)
 128 x  128: Q5_0    12.7 GFLOPS (128 runs) | Q5_1    12.5 GFLOPS (128 runs) | Q8_0    14.1 GFLOPS (128 runs)
 128 x  128: F16      9.3 GFLOPS (128 runs) | F32     20.9 GFLOPS (128 runs)
 256 x  256: Q4_0    17.9 GFLOPS (128 runs) | Q4_1    17.5 GFLOPS (128 runs)
 256 x  256: Q5_0    17.8 GFLOPS (128 runs) | Q5_1    16.2 GFLOPS (128 runs) | Q8_0    20.3 GFLOPS (128 runs)
 256 x  256: F16     10.4 GFLOPS (128 runs) | F32     28.8 GFLOPS (128 runs)
 512 x  512: Q4_0    21.1 GFLOPS ( 79 runs) | Q4_1    20.0 GFLOPS ( 75 runs)
 512 x  512: Q5_0    18.6 GFLOPS ( 70 runs) | Q5_1    19.1 GFLOPS ( 72 runs) | Q8_0    22.0 GFLOPS ( 83 runs)
 512 x  512: F16     10.5 GFLOPS ( 40 runs) | F32     25.7 GFLOPS ( 97 runs)
1024 x 1024: Q4_0    20.6 GFLOPS ( 10 runs) | Q4_1    20.4 GFLOPS ( 10 runs)
1024 x 1024: Q5_0    20.2 GFLOPS ( 10 runs) | Q5_1    18.7 GFLOPS (  9 runs) | Q8_0    23.2 GFLOPS ( 11 runs)
1024 x 1024: F16     11.4 GFLOPS (  6 runs) | F32     16.6 GFLOPS (  8 runs)
2048 x 2048: Q4_0    22.3 GFLOPS (  3 runs) | Q4_1    22.4 GFLOPS (  3 runs)
2048 x 2048: Q5_0    22.0 GFLOPS (  3 runs) | Q5_1    20.9 GFLOPS (  3 runs) | Q8_0    25.8 GFLOPS (  3 runs)
2048 x 2048: F16     11.9 GFLOPS (  3 runs) | F32     11.5 GFLOPS (  3 runs)
4096 x 4096: Q4_0    22.7 GFLOPS (  3 runs) | Q4_1    22.6 GFLOPS (  3 runs)
4096 x 4096: Q5_0    22.2 GFLOPS (  3 runs) | Q5_1    21.0 GFLOPS (  3 runs) | Q8_0    26.2 GFLOPS (  3 runs)
4096 x 4096: F16     12.0 GFLOPS (  3 runs) | F32     13.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON tiny 4 117 3631 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON base 4 153 8603 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON small 4 323 33605 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON medium 4 1059 111404 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON large 4 3187 222130 5e2b340

Jetson Orin Nano (Developer Kit)

Running memcpy benchmark

memcpy: 6.28 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     4.1 GFLOPS (128 runs) | Q4_1     4.2 GFLOPS (128 runs)
  64 x   64: Q5_0     4.2 GFLOPS (128 runs) | Q5_1     4.1 GFLOPS (128 runs) | Q8_0     4.6 GFLOPS (128 runs)
  64 x   64: F16      4.0 GFLOPS (128 runs) | F32      5.2 GFLOPS (128 runs)
 128 x  128: Q4_0    12.9 GFLOPS (128 runs) | Q4_1    13.2 GFLOPS (128 runs)
 128 x  128: Q5_0    12.7 GFLOPS (128 runs) | Q5_1    12.5 GFLOPS (128 runs) | Q8_0    14.1 GFLOPS (128 runs)
 128 x  128: F16      9.3 GFLOPS (128 runs) | F32     20.9 GFLOPS (128 runs)
 256 x  256: Q4_0    17.9 GFLOPS (128 runs) | Q4_1    17.5 GFLOPS (128 runs)
 256 x  256: Q5_0    17.8 GFLOPS (128 runs) | Q5_1    16.2 GFLOPS (128 runs) | Q8_0    20.3 GFLOPS (128 runs)
 256 x  256: F16     10.4 GFLOPS (128 runs) | F32     28.8 GFLOPS (128 runs)
 512 x  512: Q4_0    21.1 GFLOPS ( 79 runs) | Q4_1    20.0 GFLOPS ( 75 runs)
 512 x  512: Q5_0    18.6 GFLOPS ( 70 runs) | Q5_1    19.1 GFLOPS ( 72 runs) | Q8_0    22.0 GFLOPS ( 83 runs)
 512 x  512: F16     10.5 GFLOPS ( 40 runs) | F32     25.7 GFLOPS ( 97 runs)
1024 x 1024: Q4_0    20.6 GFLOPS ( 10 runs) | Q4_1    20.4 GFLOPS ( 10 runs)
1024 x 1024: Q5_0    20.2 GFLOPS ( 10 runs) | Q5_1    18.7 GFLOPS (  9 runs) | Q8_0    23.2 GFLOPS ( 11 runs)
1024 x 1024: F16     11.4 GFLOPS (  6 runs) | F32     16.6 GFLOPS (  8 runs)
2048 x 2048: Q4_0    22.3 GFLOPS (  3 runs) | Q4_1    22.4 GFLOPS (  3 runs)
2048 x 2048: Q5_0    22.0 GFLOPS (  3 runs) | Q5_1    20.9 GFLOPS (  3 runs) | Q8_0    25.8 GFLOPS (  3 runs)
2048 x 2048: F16     11.9 GFLOPS (  3 runs) | F32     11.5 GFLOPS (  3 runs)
4096 x 4096: Q4_0    22.7 GFLOPS (  3 runs) | Q4_1    22.6 GFLOPS (  3 runs)
4096 x 4096: Q5_0    22.2 GFLOPS (  3 runs) | Q5_1    21.0 GFLOPS (  3 runs) | Q8_0    26.2 GFLOPS (  3 runs)
4096 x 4096: F16     12.0 GFLOPS (  3 runs) | F32     13.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON tiny 4 117 3631 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON base 4 153 8603 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON small 4 323 33605 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON medium 4 1059 111404 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON large 4 3187 222130 5e2b340

@mark-beeby You sure everything is correct with your distro as your results are really bad, to what I was expecting. As been looking forward to see what a Orin nano could do.

Check out an rk3588 #89 (comment) as that is an A76x4 with DDR4 not DDR5...

Also interested in what you get with cuBlas https://github.com/ggerganov/whisper.cpp#opencl-gpu-support-via-clblast

Jetson Orin Nano (Developer Kit) - CUBLAS

Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy: 6.26 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     1.0 GFLOPS (128 runs) | Q4_1     0.9 GFLOPS (128 runs)
  64 x   64: Q5_0     0.7 GFLOPS (128 runs) | Q5_1     0.9 GFLOPS (128 runs) | Q8_0     1.0 GFLOPS (128 runs)
  64 x   64: F16      1.0 GFLOPS (128 runs) | F32      0.9 GFLOPS (128 runs)
 128 x  128: Q4_0     6.8 GFLOPS (128 runs) | Q4_1     7.3 GFLOPS (128 runs)
 128 x  128: Q5_0     7.8 GFLOPS (128 runs) | Q5_1     7.8 GFLOPS (128 runs) | Q8_0     7.8 GFLOPS (128 runs)
 128 x  128: F16      8.0 GFLOPS (128 runs) | F32      7.7 GFLOPS (128 runs)
 256 x  256: Q4_0    57.1 GFLOPS (128 runs) | Q4_1    62.5 GFLOPS (128 runs)
 256 x  256: Q5_0    62.3 GFLOPS (128 runs) | Q5_1    62.8 GFLOPS (128 runs) | Q8_0    64.6 GFLOPS (128 runs)
 256 x  256: F16     38.7 GFLOPS (128 runs) | F32     38.6 GFLOPS (128 runs)
 512 x  512: Q4_0   248.6 GFLOPS (128 runs) | Q4_1   250.9 GFLOPS (128 runs)
 512 x  512: Q5_0   250.2 GFLOPS (128 runs) | Q5_1   248.7 GFLOPS (128 runs) | Q8_0   247.8 GFLOPS (128 runs)
 512 x  512: F16    215.2 GFLOPS (128 runs) | F32    210.5 GFLOPS (128 runs)
1024 x 1024: Q4_0   884.6 GFLOPS (128 runs) | Q4_1   882.7 GFLOPS (128 runs)
1024 x 1024: Q5_0   879.2 GFLOPS (128 runs) | Q5_1   872.7 GFLOPS (128 runs) | Q8_0   632.0 GFLOPS (128 runs)
1024 x 1024: F16    651.2 GFLOPS (128 runs) | F32    627.2 GFLOPS (128 runs)
2048 x 2048: Q4_0  1349.9 GFLOPS ( 79 runs) | Q4_1  1337.1 GFLOPS ( 78 runs)
2048 x 2048: Q5_0  1332.3 GFLOPS ( 78 runs) | Q5_1  1327.7 GFLOPS ( 78 runs) | Q8_0  1304.8 GFLOPS ( 76 runs)
2048 x 2048: F16   1401.6 GFLOPS ( 82 runs) | F32   1140.0 GFLOPS ( 67 runs)
4096 x 4096: Q4_0  1967.6 GFLOPS ( 15 runs) | Q4_1  1962.9 GFLOPS ( 15 runs)
4096 x 4096: Q5_0  1956.3 GFLOPS ( 15 runs) | Q5_1  1952.7 GFLOPS ( 15 runs) | Q8_0  1929.9 GFLOPS ( 15 runs)
4096 x 4096: F16   2603.2 GFLOPS ( 19 runs) | F32   1742.4 GFLOPS ( 13 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON BLAS tiny 4 1296 544 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON BLAS base 4 1350 1015 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON BLAS small 4 1557 2901 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON BLAS medium 4 2303 7977 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON BLAS large 4 6716 12913 5e2b340

@StuartIanNaylor I've struggled to get clblast installed, and moved back to a CUDA install, and after a few hiccups and setting export CUDA_VISIBLE_DEVICES=0 I got the much more favourable results above. Hope that helps!

New desktop I built - CPU i7-13700K (turbo overclock +200MHz base), DDR5 @ 5600MT/s, GPU Intel Arc A770 LE

I tried differing numbers of thread counts, before settling on 20. Anything past 20 resulted in a drop in performance, which is obviously going to happen.

Running memcpy benchmark

memcpy: 23.16 GB/s (1 thread)
sum:    -536869898.000000

Running ggml_mul_mat benchmark with 20 threads


Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: Intel(R) OpenCL HD Graphics Device: Intel(R) Arc(TM) A770 Graphics
  64 x   64: Q4_0     0.9 GFLOPS (128 runs) | Q4_1     1.0 GFLOPS (128 runs)
  64 x   64: Q5_0     1.0 GFLOPS (128 runs) | Q5_1     1.0 GFLOPS (128 runs) | Q8_0     1.0 GFLOPS (128 runs)
  64 x   64: F16      1.0 GFLOPS (128 runs) | F32      1.0 GFLOPS (128 runs)
 128 x  128: Q4_0     5.6 GFLOPS (128 runs) | Q4_1     5.8 GFLOPS (128 runs)
 128 x  128: Q5_0     5.7 GFLOPS (128 runs) | Q5_1     5.4 GFLOPS (128 runs) | Q8_0     5.0 GFLOPS (128 runs)
 128 x  128: F16      5.6 GFLOPS (128 runs) | F32      5.5 GFLOPS (128 runs)
 256 x  256: Q4_0    40.4 GFLOPS (128 runs) | Q4_1    38.9 GFLOPS (128 runs)
 256 x  256: Q5_0    40.7 GFLOPS (128 runs) | Q5_1    40.3 GFLOPS (128 runs) | Q8_0    38.5 GFLOPS (128 runs)
 256 x  256: F16     40.8 GFLOPS (128 runs) | F32     40.8 GFLOPS (128 runs)
 512 x  512: Q4_0   260.5 GFLOPS (128 runs) | Q4_1   264.6 GFLOPS (128 runs)
 512 x  512: Q5_0   234.3 GFLOPS (128 runs) | Q5_1   254.8 GFLOPS (128 runs) | Q8_0   260.2 GFLOPS (128 runs)
 512 x  512: F16    223.7 GFLOPS (128 runs) | F32    261.0 GFLOPS (128 runs)
1024 x 1024: Q4_0  1158.0 GFLOPS (128 runs) | Q4_1  1158.2 GFLOPS (128 runs)
1024 x 1024: Q5_0  1119.2 GFLOPS (128 runs) | Q5_1  1157.4 GFLOPS (128 runs) | Q8_0  1125.5 GFLOPS (128 runs)
1024 x 1024: F16    871.3 GFLOPS (128 runs) | F32   1029.7 GFLOPS (128 runs)
2048 x 2048: Q4_0  2847.7 GFLOPS (128 runs) | Q4_1  2749.8 GFLOPS (128 runs)
2048 x 2048: Q5_0  2752.3 GFLOPS (128 runs) | Q5_1  2879.4 GFLOPS (128 runs) | Q8_0  2770.3 GFLOPS (128 runs)
2048 x 2048: F16   2061.0 GFLOPS (120 runs) | F32   2504.5 GFLOPS (128 runs)
4096 x 4096: Q4_0  4681.2 GFLOPS ( 35 runs) | Q4_1  4637.2 GFLOPS ( 34 runs)
4096 x 4096: Q5_0  4646.7 GFLOPS ( 34 runs) | Q5_1  4586.6 GFLOPS ( 34 runs) | Q8_0  4589.7 GFLOPS ( 34 runs)
4096 x 4096: F16   3444.7 GFLOPS ( 26 runs) | F32   4128.2 GFLOPS ( 31 runs)
CPU OS Config Model Th Load Enc. Commit
Intel Core i7-13700K Arch Linux AVX2 BLAS tiny 20 145 417 5e2b340
Intel Core i7-13700K Arch Linux AVX2 BLAS base 20 161 560 5e2b340
Intel Core i7-13700K Arch Linux AVX2 BLAS small 20 281 1072 5e2b340
Intel Core i7-13700K Arch Linux AVX2 BLAS medium 20 606 2771 5e2b340
Intel Core i7-13700K Arch Linux AVX2 BLAS large 20 1116 4105 5e2b340

CPU power draw during these last tests averaged 140 watts, peaking at 141. GPU metrics are currently not exposed in Linux for Arc, so I'm unable to check what that was drawing.