Benchmark results

Question

Benchmark results

ggerganov opened this issue 2 years ago · 162 comments

Encoder

Collection of bench results for various platforms and devices.
If you want to submit info about your device, simply run the bench tool or the extra/bench-all.sh and report the results in the comments below.

Suggestions for better summary of the results are welcome

CPU	OS	Config	Model	Th	Load	Enc.	Commit
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	tiny	8	71	102	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	base	8	96	220	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	8	233	685	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	medium	8	603	1928	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	large	8	1158	3350	`206fc93`
---
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	1	251	2605	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	4	255	884	`206fc93`
---
Mac Mini M1	MacOS	NEON BLAS	tiny	4	62	194	`fcf515d`
Mac Mini M1	MacOS	NEON BLAS	base	4	81	380	`fcf515d`
Mac Mini M1	MacOS	NEON BLAS	small	4	204	1249	`fcf515d`
Mac Mini M1	MacOS	NEON BLAS	medium	4	876	3980	`fcf515d`
Mac Mini M1	MacOS	NEON BLAS	large	4	1876	7979	`fcf515d`
---
Ryzen 9 3900X	Ubuntu 20.04	AVX2	tiny	8	107	422	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2	base	8	137	880	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2	small	8	280	2874	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2	medium	8	692	9610	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2	large	8	1317	16917	`fcf515d`
---
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	tiny	4	120	780	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	base	4	151	1173	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	small	4	289	3062	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	medium	4	711	9175	`fcf515d`
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	large	4	1282	16050	`fcf515d`
---
Ryzen 9 5950X	Ubuntu 22.04	AVX2	tiny	8	135	197	`fcf515d`
Ryzen 9 5950X	Ubuntu 22.04	AVX2	base	8	176	421	`fcf515d`
Ryzen 9 5950X	Ubuntu 22.04	AVX2	small	8	357	1393	`fcf515d`
Ryzen 9 5950X	Ubuntu 22.04	AVX2	medium	8	855	4404	`fcf515d`
Ryzen 9 5950X	Ubuntu 22.04	AVX2	large	8	1576	8118	`fcf515d`
---
Raspberry Pi 4		NEON	tiny	4	1436	13839	`fcf515d`
Raspberry Pi 4		NEON	base	4	1894	30552	`fcf515d`
---
iPhone 13 Mini	iOS 16.0	NEON BLAS	base	4	97	1091	`fcf515d`
---
MacBook M1 Pro	Vivaldi	WASM	tiny	8	133	3785	`fcf515d`
MacBook M1 Pro	Vivaldi	WASM	base	8	172	8253	`fcf515d`
---
MacBook M1 Pro	Chrome	WASM	tiny	8	134	3776	`fcf515d`
MacBook M1 Pro	Chrome	WASM	base	8	168	8200	`fcf515d`
---
MacBook M1 Pro	Firefox	WASM	tiny	8	137	2626	`fcf515d`
MacBook M1 Pro	Firefox	WASM	base	8	183	6226	`fcf515d`

memcpy

MacBook M1 Pro

./bench -w 1 -t 1
memcpy: 37.59 GB/s

Ryzen 9 5950X

./bench -w 1 -t 1
memcpy: 16.74 GB/s

ggml_mul_mat

MacBook M1 Pro

./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16    330.6 GFLOPS (128 runs) / F32    466.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16    737.5 GFLOPS (128 runs) / F32    838.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    938.6 GFLOPS (128 runs) / F32   1062.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16   1312.5 GFLOPS (128 runs) / F32   1835.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1765.1 GFLOPS (128 runs) / F32   2041.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1784.3 GFLOPS (104 runs) / F32   1859.2 GFLOPS (109 runs)
ggml_mul_mat:  4096 x  4096: F16   1855.1 GFLOPS ( 14 runs) / F32   1873.3 GFLOPS ( 14 runs)

Ryzen 9 5950X

WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     56.3 GFLOPS (128 runs) / F32     70.2 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     47.8 GFLOPS (128 runs) / F32     67.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    185.1 GFLOPS (128 runs) / F32    332.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    386.4 GFLOPS (128 runs) / F32    658.6 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    636.2 GFLOPS (128 runs) / F32   1012.0 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    950.9 GFLOPS ( 56 runs) / F32   1296.8 GFLOPS ( 76 runs)
ggml_mul_mat:  4096 x  4096: F16   1168.6 GFLOPS (  9 runs) / F32   1403.1 GFLOPS ( 11 runs)

Answer 1 · 2022-10-25T18:50:27.000Z

Results for Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz

CPU	OS	Model	Threads	Load [ms]	Encode [ms]
i7-4790K	Debian	tiny.en	4	165	808
i7-4790K	Debian	tiny.en	8	165	783
i7-4790K	Debian	base.en	4	212	1813
i7-4790K	Debian	base.en	8	214	1746

Answer 2 · 2022-10-26T12:46:21.000Z

Results for Ryzen 5 4500U 6C/6T laptop CPU (I've just included one result for 8 threads as Encode time is much higher when threads > CPU cores).

CPU	OS	Model	Threads	Load [ms]	Encode [ms]
Ryzen 5 4500U (6C/6T)	Opensuse Leap	tiny.en	4	170.00	829.43
Ryzen 5 4500U (6C/6T)	Opensuse Leap	tiny.en	6	143.03	671.74
Ryzen 5 4500U (6C/6T)	Opensuse Leap	base.en	4	305.92	2,092.39
Ryzen 5 4500U (6C/6T)	Opensuse Leap	base.en	6	188.05	1,495.61
Ryzen 5 4500U (6C/6T)	Opensuse Leap	small.en	4	408.03	6,919.31
Ryzen 5 4500U (6C/6T)	Opensuse Leap	small.en	6	359.23	6,370.83
Ryzen 5 4500U (6C/6T)	Opensuse Leap	medium.en	4	2,238.11	25,863.28
Ryzen 5 4500U (6C/6T)	Opensuse Leap	medium.en	6	1,113.04	19,672.63
Ryzen 5 4500U (6C/6T)	Opensuse Leap	medium.en	8	973.65	39,619.20

Answer 3 · 2022-10-26T15:25:15.000Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11800H	WSL2 Ubuntu	AVX2	tiny	2	164.35	1087.61
i7-11800H	WSL2 Ubuntu	AVX2	tiny	4	128.94	733.24
i7-11800H	WSL2 Ubuntu	AVX2	tiny	8	137.57	619.88
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	tiny	2	143.02	1087.15
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	tiny	4	127.60	730.57
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	tiny	8	125.62	616.27
i7-11800H	WSL2 Ubuntu	AVX2 AVX512 BLAS	tiny	2	132.59	1511.38
i7-11800H	WSL2 Ubuntu	AVX2 AVX512 BLAS	tiny	4	132.48	1407.49
i7-11800H	WSL2 Ubuntu	AVX2 AVX512 BLAS	tiny	8	133.82	1458.27

Answer 4 · 2022-10-26T15:35:06.000Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11800H	WSL2 Ubuntu	AVX2	base	2	174.34	2533.79
i7-11800H	WSL2 Ubuntu	AVX2	base	4	166.68	1830.67
i7-11800H	WSL2 Ubuntu	AVX2	base	8	165.53	1478.73
i7-11800H	WSL2 Ubuntu	AVX2	small	2	340.12	8714.24
i7-11800H	WSL2 Ubuntu	AVX2	small	4	394.32	6021.41
i7-11800H	WSL2 Ubuntu	AVX2	small	8	305.98	4828.84
i7-11800H	WSL2 Ubuntu	AVX2	large	2	3205.36	57109.10
i7-11800H	WSL2 Ubuntu	AVX2	large	4	2720.25	38519.89
i7-11800H	WSL2 Ubuntu	AVX2	large	8	3716.34	27739.99

Answer 5 · 2022-10-26T15:41:21.000Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	large	2	1954.21	54966.84
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	large	4	1455.40	37320.62
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	large	8	1372.58	27937.64

Answer 6 · 2022-10-26T15:44:27.000Z

This performance is impressing!

M1 Pro | MacOS | | large | 8 | 1973 | 4208

Answer 7 · 2022-10-26T19:32:12.000Z

This performance is impressing!

Yes, there is a huge performance boost due to using the built-in BLAS implementation on these devices. I will soon add OpenBLAS support for x86 architectures and see how this compares.

By the way, AVX-512 is not supported on master. I have added initial support here, but I am not sure if it works: #95

Answer 8 · 2022-10-28T20:45:56.000Z

CPU	OS	Config	Model	Threads	Load[ms]	encode[ms]
Intel® Core™ i5-8250U	Win11 Home	AVX2	Large	8	2226.85	61547.61

compiled with MinGW64 gcc 11.3

Answer 9 · 2022-10-29T00:06:50.000Z

Valve Jupiter (AMD Custom APU 0405, Zen 2 microarch, 4c8t, 16GB DDR5 @ 5200 MT/s)

CPU	OS	Config	Model	Threads	Load[ms]	encode[ms]
AMD Custom APU 0405	SteamOS 3.2	AVX2	Base	8	326.32	2592.96

Compiled with cc (GCC) 11.3.0

The performance gains on jfk.wav since last test (two weeks or so ago) are extremely impressive, ~10-20x speedup from 40 to 2-4 seconds.

Answer 10 · 2022-10-30T00:14:59.000Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
MacBook M1 Max	macOS Ventura	BLAS	small	1	299.09	4166.00
MacBook M1 Max	macOS Ventura	BLAS	small	4	329.45	1304.32
MacBook M1 Max	macOS Ventura	BLAS	base	1	139.10	1302.17
MacBook M1 Max	macOS Ventura	BLAS	base	4	135.96	399.45

Answer 11 · 2022-10-31T12:36:28.000Z

On a AMD EPYC 64 core 240 threads cloud instance it is stuck like this with 240 threads. I noticed that above a certain number of threads its slow, or the cloud provider is cpu limiting. Can anyone else with real hardware check if this is the case?

time ./main -m models/ggml-base.en.bin -f elon.wav -t 240
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 670.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 240 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing 'elon.wav' (34466688 samples, 2154.2 sec), 240 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ..

Answer 12 · 2022-10-31T12:52:48.000Z

So I have tried with the above mentioned cloud provider various number of threads.

I found that anything above 64 threads gets slower and usable upto 120 threads. Anything above is a hang. Must be that the cloud provider is throttling on free trial or too many threads could actually slow down stuff.

...
...
processor       : 239
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 49
model name      : AMD EPYC 7742 64-Core Processor
stepping        : 0
microcode       : 0x830104d
cpu MHz         : 2245.780
cache size      : 512 KB
physical id     : 1
siblings        : 120
core id         : 59
cpu cores       : 60
apicid          : 247
initial apicid  : 247
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip rdpid arch_capabilities
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 4491.56
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:

time ./main -m models/ggml-base.en.bin -f elon.wav -t 64
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 670.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 64 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing 'elon.wav' (34466688 samples, 2154.2 sec), 64 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:03.960]   [MUSIC PLAYING]
[00:00:03.960 --> 00:00:18.240]   In life, we've seen within this part of the world
...
...
[00:35:40.320 --> 00:35:41.920]   Thank you, and have a great day.
[00:35:41.920 --> 00:35:43.920]   [APPLAUSE]
[00:35:43.920 --> 00:35:45.920]   [MUSIC PLAYING]
[00:35:45.920 --> 00:35:56.240]   [VIDEO PLAYBACK]


whisper_print_timings:     load time =   249.61 ms
whisper_print_timings:      mel time =  1267.11 ms
whisper_print_timings:   sample time =  1718.69 ms
whisper_print_timings:   encode time = 63702.25 ms / 10617.04 ms per layer
whisper_print_timings:   decode time = 381317.66 ms / 63552.94 ms per layer
whisper_print_timings:    total time = 448362.19 ms

real    7m28.411s
user    347m2.230s
sys     22m42.511s

32 threads was faster than 64 threads. I think 32 threads took around 7 minutes or so.

Answer 13 · 2022-10-31T13:10:33.000Z

Env: Restricted Cloud / Throttled Maybe

CPU: AMD EPYC 7742 64-Core Processor

OS:

Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:        20.04
Codename:       focal
Linux XXXX 5.4.0-131-generic #147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Compiler:

gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.4.0-1ubuntu1~20.04.1' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-Av3uEd/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

$ ./bench -m ./models/ggml-small.en.bin -t 4
whisper_model_load: loading model from './models/ggml-small.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1588.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size =    68.48 MB
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 4 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

whisper_print_timings:     load time =   515.02 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  6878.32 ms / 573.19 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  7393.42 ms

If you wish, you can submit these results here:

  https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

  - CPU model
  - Operating system
  - Compiler

$ ./bench -m ./models/ggml-small.en.bin -t 240
whisper_model_load: loading model from './models/ggml-small.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1588.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size =    68.48 MB
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 240 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

whisper_print_timings:     load time =   528.66 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 12898.34 ms / 1074.86 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time = 13427.03 ms

If you wish, you can submit these results here:

  https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

  - CPU model
  - Operating system
  - Compiler

I'll remove the above posts if too much clutter.

Answer 14 · 2022-10-31T17:45:02.000Z

@trholding
Thanks for the results.

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Regarding the threads - yes, it seems that going beyond 8 threads does not help regardless of how many cores you have. My guess is that the computation is memory-bound so that's why using more threads does not improve the performance.

Answer 15 · 2022-10-31T18:37:35.000Z

Okay, 8 threads max, so for a large file, is there a possibility of splitting the file to chunks with silences as terminators and dividing the conversion to ((total threads/cores)/8) but also keeping track of timestamps? This could be awesome for batch conversion.

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Oh, I didn't know, I'll update with tables soon and remove my previous comments in a few hours.

Answer 16 · 2022-10-31T22:55:48.000Z

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Hey Sorry. That didn't pan out well, I did the benchmark thrice, my account got deleted without notice. Could't get the logs as it was a web terminal. On the other hand I am happy that this happened, I was giving serious thought of purchasing a GPU+CPU plan there, so performance check of CPU was equally important. Probably or technically it was my fault - probably shouldn't have used a reverse shell and done benchmarks on a free trial, but how does one know if a service is real good or all just vapor...

Answer 17 · 2022-11-05T06:43:35.000Z

Dell Precision 5560 laptop results:

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11850H	Ubuntu	AVX2	tiny	4	115.87	538.43
i7-11850H	Ubuntu	AVX2	base	4	145.14	1241.84
i7-11850H	Ubuntu	AVX2	small	4	299.30	4343.57
i7-11850H	Ubuntu	AVX2	medium	4	760.98	15238.31
i7-11850H	Ubuntu	AVX2	large	4	1404.32	27476.86
i7-11850H	Ubuntu	AVX2	tiny	8	131.96	358.81
i7-11850H	Ubuntu	AVX2	base	8	166.61	839.31
i7-11850H	Ubuntu	AVX2	small	8	320.29	2854.86
i7-11850H	Ubuntu	AVX2	medium	8	756.20	9829.62
i7-11850H	Ubuntu	AVX2	large	8	1382.38	19872.81

Answer 18 · 2022-11-05T10:34:15.000Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i9-9900K	WSL2 Ubuntu (GCC)	AVX2	tiny.en	4	85.71	601.56
i9-9900K	WSL2 Ubuntu (GCC)	AVX2	small.en	4	212.59	5146.23
i9-9900K	OSX 10.14.1 (hackintosh - GCC)	AVX2	tiny.en	4	198.17	455.12
i9-9900K	OSX 10.14.1 (hackintosh - GCC)	AVX2	base.en	4	272.62	909.71
i9-9900K	OSX 10.14.1 (hackintosh - GCC)	AVX2	small.en	4	598.75	2968.75
Xeon(R) Silver 4210R CPU @ 2.40GHz	Virtual Machine - Debian Stretch (GCC - master branch)	AVX2 avx512f avx512dq avx512cd avx512bw avx512vl	small.en	4	776.56	12340.41
Xeon(R) Silver 4210R CPU @ 2.40GHz	Virtual Machine - Debian Stretch (GCC - master branch)	AVX2 avx512f avx512dq avx512cd avx512bw avx512vl	tiny.en	4	295.54	1710.46

Answer 19 · 2022-11-08T09:18:49.000Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Tiny	4	124.28	656.41
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Tiny	8	123.70	696.41
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Base	4	159.91	1754.44
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Base	8	164.47	1658.55
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Small	4	330.91	6161.86
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Small	8	346.22	5187.85

Answer 20 · 2022-11-09T19:57:02.000Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-1065G7	Windows 11	-	small.en	4	1,314.25	294,168.09

Compiled with VS 2022

Something is off, right?

Answer 21 · 2022-11-09T20:13:11.000Z

Yup - you are missing the AVX2 flag. See if some of the comments in #5 can help you resolve this.

Answer 22 · 2022-11-09T20:33:55.000Z

OK, the AVX2 flag seems to help :)

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-1065G7	Windows 11	AVX2	small.en	4	527.59	9,648.67

Compiled with VS 2022

Answer 23 · 2022-11-17T11:02:17.000Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]	Remarks
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny	1	861.34	29428.21	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	1	843.80	16145.62	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny	4	835.68	21509.08	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	4	824.24	13187.96	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base	1	1146.02	87615.00	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	1	1103.39	52228.30	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base	4	1183.47	55256.20	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	4	1161.32	29851.40	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny	1	752.64	24018.10	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	1	751.96	13082.95	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny	4	743.37	10122.80	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	4	742.90	9564.89	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base	1	974.46	71587.61	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	1	979.65	43852.07	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base	4	982.24	24814.62	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	4	982.80	19910.19	Without OVOS services running

Answer 24 · 2022-11-17T11:37:39.000Z

From the stream repo

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny.en	4	243.54 ms	779.49 ms
RK3588	Ubuntu20.04	NEON	base.en	4	316.52 ms	1821.06 ms
RK3588	Ubuntu20.04	NEON	small.en	4	618.93 ms	7117.69 ms
RK3588	Ubuntu20.04	NEON	medium.en	4	1514.88 ms	24139.92 ms

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny	4	233.86 ms	791.01 ms
RK3588	Ubuntu20.04	NEON	base	4	297.93 ms	1813.69 ms
RK3588	Ubuntu20.04	NEON	small	4	592.18 ms	7102.28 ms
RK3588	Ubuntu20.04	NEON	medium	4	1587.36 ms	24147.87 ms

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny	8	226.48 ms	740.34 ms
RK3588	Ubuntu20.04	NEON	base	8	300.48 ms	1723.42 ms
RK3588	Ubuntu20.04	NEON	small	8	620.58 ms	6392.47 ms
RK3588	Ubuntu20.04	NEON	medium	8	1533.75 ms	21899.08 ms

I still haven't worked out the little(0-3).Big(4-7) on this thing as if I pin to big cores taskset -c 4-7

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny.en	4	234.14 ms	681.53 ms
RK3588	Ubuntu20.04	NEON	base.en	4	297.08 ms	1679.75 ms
RK3588	Ubuntu20.04	NEON	small.en	4	599.98 ms	6867.66 ms
RK3588	Ubuntu20.04	NEON	medium.en	4	1492.73 ms	23600.45 ms

I tried to compile with openBlas but seemed to kill the make

From the master repo as didn't think about the repo after trying streaming input

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny	8	226.48 ms	2681.05 ms
RK3588	Ubuntu20.04	NEON	base	8	283.56 ms	6132.44 ms
RK3588	Ubuntu20.04	NEON	small	8	583.39 ms	24397.78 ms
RK3588	Ubuntu20.04	NEON	medium	8	1490.98	85099.45 ms

Answer 25 · 2022-11-17T12:06:04.000Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	tiny.en	8	136.29	454.52
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	tiny	8	134.64	486.01
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	base	8	180.22	1184.80
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	base.en	8	192.86	1197.85
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	small	8	367.55	4179.00
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	small.en	8	378.27	4557.73
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	medium	8	923.48	15552.61
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	medium.en	8	952.48	15708.63
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	large	8	1650.28	28357.09

8 threads seemed to be the fastest. However I managed to squeeze a bit more performance by pinning CPU:

$ taskset -c 0-15 ./extra/bench-all.sh 16

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	tiny	16	143.17	437.73
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	base	16	184.10	1061.14
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	small	16	374.41	3645.64
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	medium	16	935.45	13029.54

Answer 26 · 2022-11-21T16:20:26.000Z

Results for AWS Graviton 3 Processor (c7g.4xlarge instance type).

Compiled with -march=native -ffast-math.

./extra/bench-all.sh 8

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Graviton 3	Ubuntu 22.04	NEON	tiny	8	125.92	230.33
Graviton 3	Ubuntu 22.04	NEON	base	8	160.17	547.88
Graviton 3	Ubuntu 22.04	NEON	small	8	299.59	2138.86
Graviton 3	Ubuntu 22.04	NEON	medium	8	741.49	6999.33
Graviton 3	Ubuntu 22.04	NEON	large	8	1313.95	14174.00

./extra/bench-all.sh 16

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Graviton 3	Ubuntu 22.04	NEON	tiny	16	121.92	158.61
Graviton 3	Ubuntu 22.04	NEON	base	16	156.01	386.78
Graviton 3	Ubuntu 22.04	NEON	small	16	299.85	1596.38
Graviton 3	Ubuntu 22.04	NEON	medium	16	750.93	5351.24
Graviton 3	Ubuntu 22.04	NEON	large	16	1313.82	11115.69

Answer 27 · 2022-11-21T16:25:52.000Z

@matth Do you observe significant performance difference with / without -march=native -ffast-math?

Answer 28 · 2022-11-21T21:16:42.000Z

@ggerganov -ffast-math seems to make only a very small difference that could be noise between runs

-march=native does seem to make a big difference, without it FP16_VA is not reported as being enabled (I can get this with -march=armv8.4-a+bf16+fp16fml) - I think -march=native is enabling more intrinsics than this though.

Results without any -march or -ffast-math flags ...

./extra/bench-all.sh 16

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Graviton 3	Ubuntu 22.04	NEON	tiny	16	124.25	320.53
Graviton 3	Ubuntu 22.04	NEON	base	16	156.91	734.22
Graviton 3	Ubuntu 22.04	NEON	small	16	301.78	2812.75
Graviton 3	Ubuntu 22.04	NEON	medium	16	714.23	9139.86
Graviton 3	Ubuntu 22.04	NEON	large	16	1298.33	18147.47

I have tried to improve by using OpenBlas and armpl.h but with they both slow it down considerably - I'll keep trying with the latter.

Are there any possibilities for further optimisations in ggml.c that can take advantage of the situation where you have bf16 functions but not BLAS or Accelerate?

Answer 29 · 2022-11-22T08:02:46.000Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
E5-2640	Ubuntu 18.04	AVX2	tiny	8	235.10	1094.45
E5-2640	Ubuntu 18.04	AVX2	base	8	326.11	2307.32
E5-2640	Ubuntu 18.04	AVX2	small	8	669.31	7706.24

Answer 30 · 2022-11-22T10:23:21.000Z

@matth
My experiments with OpenBLAS on x86 showed that it is not faster compared to hand-written AVX2 + FP16:
fbd513b

It seems this is also the case for Arm based on your experiments. My guess is that we don't see improvement because the computation is memory-bound and OpenBLAS works with FP32.

The reason that on Apple Silicon using CBLAS is so fast is because it utilizes the matrix co-processor which somehow is very efficient even for FP32. At least this is how I explain the results that I am seeing.

Interesting if armpl.h can provide some more insight - I haven't used it.

The most heavy stuff in ggml.c is the mul_mat_f16 and flash_attn_f16 calls. I think the conv_1d_... calls could be probably optimized more, but they are called only once as the start of the Encoder, so the improvement would be marginal.

Also, I am just looking at whisper.cpp and I realize I have forgotten why I use Flash Attention only in the Encoder and not use it also in the Decoder. Maybe this can help, because the Flash Attention reduces the memory transfers and improves cache locality.

Not sure about bf16 compared to fp16. I don't expect to provide big improvement based on quick search through some articles about the difference between the 2 data types.

Answer 31 · 2022-11-22T12:53:09.000Z

Ihttps://medium.com/swlh/apples-m1-secret-coprocessor-6599492fc1e1

Gives a good write up if medium doesn't try to charge you.

https://nod.ai/comparing-apple-m1-with-amx2-m1-with-neon/

Maybe after the m3 comes out I might be able to pickup a bargain m1 mini.

I think fp16 is coming though and may help a bit

OpenMathLib/OpenBLAS#3754

PS for those of us without the secret apple sauce would implementing https://github.com/CNugteren/CLBlast be any use on integrated gpu's?

Answer 32 · 2022-11-23T00:14:05.000Z

OpenBLAS helps Windows AMD64 MSVC

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Ryzen 5 PRO 2400GE	Windows 10	AVX2	medium	4	4259.10	116609.75
Ryzen 5 PRO 2400GE	Windows 10	AVX2 BLAS	medium	4	4259.58	75312.90

Answer 33 · 2022-11-23T00:51:09.000Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
rk3588	Debian11	NEON	tiny	8	232.45	2768.78
rk3588	Debian11	NEON	base	8	308.36	6374.82
rk3588	Debian11	NEON	small	8	626.23	25784.05
rk3588	Debian11	NEON	medium	8	1667.23	86026.82
rk3588	Debian11	NEON	large	8	4307.16	161328.59

CFLAGS = -I. -O3 -std=c11 -ffast-math -march=native

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
rk3588	Debian11	NEON	tiny	8	230.69	2078.40
rk3588	Debian11	NEON	base	8	299.10	4379.62
rk3588	Debian11	NEON	small	8	621.43	18565.42
rk3588	Debian11	NEON	medium	8	1532.61	65504.91
rk3588	Debian11	NEON	large	8	3618.18	121710.31

If I try to compile with open blas in seperate build Encode becomes approx x2 slower so either I am doing wrong or with Armv8.2 its just bad, its -march=native that seems to make the above difference.

Answer 34 · 2022-11-25T16:49:16.000Z

Results on AWS mac2.metal instance:

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
mac2.metal	OSX Ventura	NEON BLAS	tiny	4	64.39	184.98
mac2.metal	OSX Ventura	NEON BLAS	base	4	87.93	368.04
mac2.metal	OSX Ventura	NEON BLAS	small	4	198.80	1212.46
mac2.metal	OSX Ventura	NEON BLAS	medium	4	551.49	3552.73
mac2.metal	OSX Ventura	NEON BLAS	large	4	1042.91	6726.99

I tried disabling Accelerate and it makes a significant difference (i.e. very much slower without it!).

I assumed Accelerate was using the Neural Engine, but using both powermetrics and asitop I cannot see any utilization, both report 0mw power usage. Can anyone confirm on an M1 machine?

EDIT Possibly I was confused. Apple’s Matrix Coprocessor (AMX) and Neural Engine are different things, from @ggerganov other issues and commits it appears Accelerate might be using the former

Answer 35 · 2022-11-28T10:03:33.000Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i9-13900k	WSL2 Ubuntu	AVX2	tiny	4	58.49	360.95
i9-13900k	WSL2 Ubuntu	AVX2	base	4	72.44	756.48
i9-13900k	WSL2 Ubuntu	AVX2	small	4	154.37	2676.12
i9-13900k	WSL2 Ubuntu	AVX2	medium	4	393.76	8924.90
i9-13900k	WSL2 Ubuntu	AVX2	large	4	698.69	15862.58
i9-13900k	WSL2 Ubuntu	AVX2	tiny	8	55.13	291.51
i9-13900k	WSL2 Ubuntu	AVX2	base	8	70.93	603.33
i9-13900k	WSL2 Ubuntu	AVX2	small	8	141.85	1800.05
i9-13900k	WSL2 Ubuntu	AVX2	medium	8	356.29	5946.78
i9-13900k	WSL2 Ubuntu	AVX2	large	8	658.83	10868.89

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	tiny	4	301.22	872.27
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	base	4	405.40	1705.58
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	small	4	921.24	5419.73
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	medium	4	2356.76	15188.90
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	large	4	4457.29	26444.06
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	tiny	8	299.89	540.47
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	base	8	419.41	1129.01
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	small	8	888.64	3632.89
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	medium	8	2377.96	10525.92
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	large	8	4412.20	18933.41

Answer 36 · 2022-12-01T12:34:08.000Z

Intel(R) Core(TM) i7-8750H CPU @ 2.20GH

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	tiny	4	307.20	570.86
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	base	4	406.45	1183.90
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	small	4	941.96	4156.69
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	medium	4	3124.62	13072.06
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	large	4	10090.85	36383.82
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	tiny	8	299.42	487.26
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	base	8	403.74	1113.54
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	small	8	910.07	3955.48
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	medium	8	2241.90	13076.31
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	large	8	5620.87	25562.17

Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz (12)

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-8700	Ubuntu 20.04.4 LTS	AVX2	tiny	4	158.49	730.72
i7-8700	Ubuntu 20.04.4 LTS	AVX2	base	4	205.93	1603.67
i7-8700	Ubuntu 20.04.4 LTS	AVX2	small	4	426.62	5630.58
i7-8700	Ubuntu 20.04.4 LTS	AVX2	medium	4	1080.15	18748.66
i7-8700	Ubuntu 20.04.4 LTS	AVX2	large	4	1976.77	37188.47
i7-8700	Ubuntu 20.04.4 LTS	AVX2	tiny	8	159.00	662.07
i7-8700	Ubuntu 20.04.4 LTS	AVX2	base	8	206.62	1436.59
i7-8700	Ubuntu 20.04.4 LTS	AVX2	small	8	428.20	5345.27
i7-8700	Ubuntu 20.04.4 LTS	AVX2	medium	8	1108.97	16780.53
i7-8700	Ubuntu 20.04.4 LTS	AVX2	large	8	1965.67	32019.44
i7-8700	Ubuntu 20.04.4 LTS	AVX2	tiny	12	157.60	585.65
i7-8700	Ubuntu 20.04.4 LTS	AVX2	base	12	216.74	1696.32
i7-8700	Ubuntu 20.04.4 LTS	AVX2	small	12	428.51	4504.18
i7-8700	Ubuntu 20.04.4 LTS	AVX2	medium	12	1081.65	15442.25
i7-8700	Ubuntu 20.04.4 LTS	AVX2	large	12	1969.63	28108.55

Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz (4)

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i3-9100F	Ubuntu 20.04.4 LTS	AVX2	tiny	4	164.71	726.05
i3-9100F	Ubuntu 20.04.4 LTS	AVX2	base	4	214.56	1806.20
i3-9100F	Ubuntu 20.04.4 LTS	AVX2	small	4	445.48	6613.19
i3-9100F	Ubuntu 20.04.4 LTS	AVX2	medium	4	1131.80	22667.64
i3-9100F	Ubuntu 20.04.4 LTS	AVX2	large	4	7615.74	42137.29

Intel(R) Xeon(R) CPU E3-1220 V2 @ 3.10GHz (4)

CPU	OS	Model	Threads	Load [ms]	Encode [ms]
E3-1220 V2	Ubuntu 20.04.3 LTS	tiny	4	227.41	1757.56
E3-1220 V2	Ubuntu 20.04.3 LTS	base	4	297.67	3801.48
E3-1220 V2	Ubuntu 20.04.3 LTS	small	4	625.18	14544.59
E3-1220 V2	Ubuntu 20.04.3 LTS	medium	4	9618.55	49937.12
E3-1220 V2	Ubuntu 20.04.3 LTS	large	4	40399.48	71661.48

Answer 37 · 2022-12-01T21:08:16.000Z

Has anyone tried benchmarking on WASM? Seems like the encoder takes much longer time than other platform

Answer 38 · 2022-12-07T08:31:03.000Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-5600U @2.60GHz	Xubuntu 18.04	AVX2 BLAS	ggml-tiny.en	4	258.59	2934.34
i7-5600U @2.60GHz	Xubuntu 18.04	AVX2 BLAS	ggml-tiny	4	255.46	2906.67
i7-5600U @2.60GHz	Xubuntu 18.04	AVX2 BLAS	ggml-base.en	4	316.73	6197.29
i7-5600U @2.60GHz	Xubuntu 18.04	AVX2 BLAS	ggml-base	4	319.93	5825.65
i7-5600U @2.60GHz	Xubuntu 18.04	AVX2	ggml-tiny.en	4	217.28	1548.92
i7-5600U @2.60GHz	Xubuntu 18.04	AVX2	ggml-tiny	4	215.59	1625.69
i7-5600U @2.60GHz	Xubuntu 18.04	AVX2	ggml-base.en	4	275.62	3823.34
i7-5600U @2.60GHz	Xubuntu 18.04	AVX2	ggml-base	4	275.72	3740.50
Cortex-A53	Android 10	NEON	ggml-tiny.en	8	399.05	5841.70
Cortex-A53	Android 10	NEON	ggml-tiny	8	376.25	5548.72
Cortex-A53	Android 10	NEON	ggml-base.en	8	492.92	12728.42
Cortex-A53	Android 10	NEON	ggml-base	8	1034.48	13365.86

Test-bench properties

Benchmarking is done on commit 3996ecc156486fb93ff505c01090d13192e72aa2.
Used cmake for building (mkdir build && cd build, cmake .. && make).
Compiler for Xubuntu 18.04 is gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Compiler for Android 10 is clang version 15.0.2 (aarch64-unknown-linux-android24)
Used the following fish shell snippet to run the benchmarks:

# cwd is whisper.cpp/build
# Adding `-t 8` to `bench` for aarch64
$ for model in "ggml-tiny" "ggml-base"
      for suffix in "en.bin" "bin"
          ./bin/bench -m "../models/$model.$suffix"
      end
  end

Remarks

On x86, enabling BLAS (-DWHISPER_SUPPORT_OPENBLAS=ON) deteriorates the performance!

Answer 39 · 2022-12-09T00:33:50.000Z

Quite the difference between the 2017 Intel i3 4C/4T and the 2019 Ryzen Zen+ 6C/12T. And not looking good for AVX2 on the old AMD Zen+. I must admit, all in all I really envy the M1 for having that accelerator.

gcc vs clang doesn't seem to make a difference, at least it's not distinguishable from noise.

i3-8100

This is my home server. Tested while it was doing home server things (load 0.7). I can see this machine acting as a "whisper server" in a 2C configuration.

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]	Commit	Compiler
i3-8100 @ 3.60GHz	Arch Linux	AVX2	tiny	1	88.38	2013.67	`832b4f3`	gcc 12.2.0
i3-8100 @ 3.60GHz	Arch Linux	AVX2	base	1	113.58	4692.04	`832b4f3`	gcc 12.2.0
i3-8100 @ 3.60GHz	Arch Linux	AVX2	small	1	225.74	18469.62	`832b4f3`	gcc 12.2.0
i3-8100 @ 3.60GHz	Arch Linux	AVX2	tiny	2	89.55	1189.92	`832b4f3`	clang 14.0.6
i3-8100 @ 3.60GHz	Arch Linux	AVX2	base	2	119.97	2756.52	`832b4f3`	clang 14.0.6
i3-8100 @ 3.60GHz	Arch Linux	AVX2	small	2	238.71	10491.67	`832b4f3`	clang 14.0.6
i3-8100 @ 3.60GHz	Arch Linux	AVX2	tiny	4	201.37	695.39	`832b4f3`	gcc 12.2.0
i3-8100 @ 3.60GHz	Arch Linux	AVX2	base	4	262.76	2023.16	`832b4f3`	gcc 12.2.0
i3-8100 @ 3.60GHz	Arch Linux	AVX2	small	4	526.66	6788.01	`832b4f3`	gcc 12.2.0
i3-8100 @ 3.60GHz	Arch Linux	AVX2	medium	4	3836.26	21889.30	`832b4f3`	gcc 12.2.0
i3-8100 @ 3.60GHz	Arch Linux	AVX2	large	4	26819.67	60880.62	`832b4f3`	gcc 12.2.0
i3-8100 @ 3.60GHz	Arch Linux	AVX2	tiny	4	89.05	696.08	`832b4f3`	clang 14.0.6
i3-8100 @ 3.60GHz	Arch Linux	AVX2	base	4	114.65	1711.15	`832b4f3`	clang 14.0.6
i3-8100 @ 3.60GHz	Arch Linux	AVX2	small	4	309.30	6995.25	`832b4f3`	clang 14.0.6
i3-8100 @ 3.60GHz	Arch Linux	AVX2	medium	4	4854.02	23570.42	`832b4f3`	clang 14.0.6
i3-8100 @ 3.60GHz	Arch Linux	AVX2	large	4	21415.07	60547.99	`832b4f3`	clang 14.0.6

Ryzen 1600AF

Just my Desktop. The difference to the 5950 at 8C is really massive; but luckily it has no impact for daily usage, so I'm glad I can still wait with upgrading to the last AM4 CPU generation 😂
Looking forward to benching CUDA on this machine (3080Ti).

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]	Commit	Compiler
Ryzen 1600AF	Manjaro	AVX2	tiny	1	104.04	4691.38	`832b4f3`	clang 14.0.6
Ryzen 1600AF	Manjaro	AVX2	base	1	134.54	11092.84	`832b4f3`	clang 14.0.6
Ryzen 1600AF	Manjaro	AVX2	small	1	254.71	43923.42	`832b4f3`	clang 14.0.6
Ryzen 1600AF	Manjaro	AVX2	tiny	4	107.40	1336.49	`832b4f3`	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	base	4	132.69	3062.12	`832b4f3`	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	small	4	262.27	11655.22	`832b4f3`	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	medium	4	662.81	38829.74	`832b4f3`	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	large	4	1365.09	77063.30	`832b4f3`	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	tiny	6	100.82	1007.36	`832b4f3`	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	base	6	130.20	2472.55	`832b4f3`	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	small	6	256.83	9311.54	`832b4f3`	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	medium	6	657.89	28051.40	`832b4f3`	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	large	6	1190.62	54292.72	`832b4f3`	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	tiny	6	104.77	1012.70	`832b4f3`	clang 14.0.6
Ryzen 1600AF	Manjaro	AVX2	base	6	137.00	2212.20	`832b4f3`	clang 14.0.6
Ryzen 1600AF	Manjaro	AVX2	small	6	257.97	9296.33	`832b4f3`	clang 14.0.6
Ryzen 1600AF	Manjaro	AVX2	medium	6	624.04	28524.38	`832b4f3`	clang 14.0.6
Ryzen 1600AF	Manjaro	AVX2	large	6	1189.10	56445.31	`832b4f3`	clang 14.0.6
Ryzen 1600AF	Manjaro	AVX2	tiny	12	101.41	898.96	`832b4f3`	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	base	12	139.26	2200.78	`832b4f3`	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	small	12	256.50	8125.48	`832b4f3`	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	medium	12	623.59	29255.08	`832b4f3`	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	large	12	1192.90	51902.81	`832b4f3`	gcc 12.2.0

Answer 40 · 2022-12-10T20:52:48.000Z

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]	Commit	Compiler
POWER9v2	Gentoo	`-Ofast -mcpu=native`	base.en	4/64	144.84	42708.33	`85c9ac1`	clang 15.0.3
POWER9v2	Gentoo	`-Ofast -mcpu=native`	base.en	16/64	161.95	22302.28	`85c9ac1`	clang 15.0.3
POWER9v2	Gentoo	`-Ofast -mcpu=native`	base.en	32/64	142.06	20263.56	`85c9ac1`	clang 15.0.3
POWER9v2	Gentoo	`-Ofast -mcpu=native`	base.en	64/64	160.51	12645.79	`85c9ac1`	clang 15.0.3

Answer 41 · 2022-12-11T18:09:57.000Z

@Xavier-i
WASM performance is much worse compared to native - this is expected.
Today I added the bench.wasm that can be used to benchmark performance in the browser.

Link: https://whisper.ggerganov.com/bench/

Answer 42 · 2022-12-13T09:59:11.000Z

Redo of my OpenVoiceOS Raspberry Pi 4 benchmark

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny.en	4	735	9486	`aa6adda`
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base.en	4	950	25402	`aa6adda`
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny.en	4	752	9178	`aa6adda`
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base.en	4	969	19642	`aa6adda`

And just (and only) because we can, the same on a Raspberry Pi 3B+ running the same codebase / OS

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Raspberry Pi 3B+ - 1GB	OpenVoiceOS	NEON	tiny.en	4	1331	22573	`aa6adda`
Raspberry Pi 3B+ - 1GB	OpenVoiceOS	NEON	base.en	4	5886	58733	`aa6adda`
Raspberry Pi 3B+ - 1GB	OpenVoiceOS	NEON BLAS	tiny.en	4	1333	21184	`aa6adda`
Raspberry Pi 3B+ - 1GB	OpenVoiceOS	NEON BLAS	base.en	4	4605	47877	`aa6adda`

Answer 43 · 2022-12-14T13:43:22.000Z

I hope this isn't misplaced but I thought it interesting to share ...

I have recently finished some tests comparing whisper.cpp runtime performance against the original PyTorch version on various GPUs and CPUs.

We test against a fixed set of long form audio files (UK TV, each file ~1 hour long, mixed speech and noise) and record the runtime as a factor of real audio time.

Depending on the software and environment transcription can take anywhere between around 5x real-time to 0.14x real-time to complete.

ARM based whisper.cpp runtime is very impressive, in particular the Apple M1 performance can match that of the original PyTorch version on NVIDIA V100 and T4 gpus ...

CPU / GPU	OS	Config	Model	Threads	xRT Transcribe
Intel Xeon	Ubuntu 22.04	whisper original - pytorch cpu	medium.en	8	4.78
Intel Xeon	Ubuntu 22.04	whisper.cpp - AVX2	medium.en	8	4.44
Graviton 3	Ubuntu 22.04	whisper.cpp - NEON	medium.en	8	0.63
mac2.metal	OSX Ventura	whisper.cpp - NEON BLAS	medium.en	4	0.26
NVIDIA V100	Ubuntu 22.04	whisper original - pytorch cuda	medium.en	N/A	0.25
NVIDIA T4	Ubuntu 22.04	whisper original - pytorch cuda	medium.en	N/A	0.25
NVIDIA A10G	Ubuntu 22.04	whisper original - pytorch cuda	medium.en	N/A	0.16
NVIDIA A100	Ubuntu 22.04	whisper original - pytorch cuda	medium.en	N/A	0.14

Additionally I did some very rough power consumption tests, again whisper.cpp on the M1 is really impressive against PyTorch on the GPU.

Platform	Whisper Type	Model	Avg Power	Peak Power
Apple M1	whisper.cpp	ggml-medium.en	13202 mW	18412 mW
Nvidia T4	pytorch	medium.en	69587 mW	85650 mW

Thanks for the fantastic work @ggerganov - this is a really inspiring project and demonstrates the ARM FP16 functionality wonderfully. Off to buy some more Apple Macs now ;)

Answer 44 · 2022-12-14T17:29:18.000Z

@matth @rgerganov Been thinking myself that perf/watts for ML is truly outstanding and just wondered if the 8gb can squeeze the medium model in as not sure how memory is shared on the m1 or is it really a case of the 16gb?

Answer 45 · 2022-12-14T21:32:03.000Z

@matth
Thanks for the data - it's interesting to see.

However, there are some caveats that are important to be considered when benchmarking the 2 implementations that I've been meaning to discuss, so here are my thoughts on this:

At a high-level, the Whisper transcription is a combination of 2 main parts:

transformer model evaluation
decoding strategy

The first part is branchless and does not depend on the audio input or the parameters that you use. For a given model, evaluating the transformer requires the same amount of operations every time. This is easy to benchmark.

The second part (decoding strategy) is different. The number of operations here depends both on the audio input contents and the decoding parameters / strategy that you use. For example, two different audio recordings with the same time length generally result in different decoded text based on the speech content and hence can take different amount of processing (even with the same decoding parameters). Also, the decoded timestamp tokens affect how the 30s sliding window of the transcription is updated and therefore can lead to a different number of transformer evaluations in total.

My understanding is that there is no "correct" decoding strategy. The OpenAI implementation generally offers 2 different strategies - Greedy and BeamSearch. Both of them are combinations of various heuristics that aim to improve the text coherency and reduce the number of catastrophic failures.

In whisper.cpp we currently have a Greedy strategy which is similar to the one in the OpenAI repo, but is not exactly the same.

So all of this means that there is no point in comparing the 2 implementations by measuring the total time to transcribe an audio, because the decoding strategy is not the same and therefore the variation will be very large due to the factors outlined above. It only makes sense to benchmark the transformer evaluation in isolation, because it is well-defined.

That is why in the benchmarks in this issue, I chose to run the Encoder on some random input buffer. The Encoder is the heavy part of the transformer and being able to evaluate it efficiently is very important and is the most defining factor for the efficiency of the implementation. It's the "engine" of the transcription. You can then put on top of it any decoding strategy that you like and this will define how accurate your transcription is. But it does not make sense to benchmark the performance of that anymore.

I think if we want to make a fair comparison with PyTorch, we need to have the bench tool implemented in python using PyTorch. Any other comparison will be flawed to some extent.

But in any case, your results are interesting - thanks for sharing them.
What parameters did you use for the PyTorch runs?

Regarding the power consumption - I think there is more we can do in whisper.cpp. Currently, the thread synchronization uses busy loops which is very power inefficient because it keeps the CPU at 100%, but it gives a slight performance edge. I am thinking of adding an option that uses condition variable synchronization which will likely reduce the power usage at the cost of some performance. For some use cases, it could be beneficial to have lower power consumption.

Answer 46 · 2022-12-15T10:09:10.000Z

Thanks @ggerganov , we are using PyTorch whisper with default settings in that benchmark so I believe that is a beam search decoder. I will see if I can test again with the greedy decoder for a more similar comparison. I think I understand your point though - these are not like for like implementations so at a certain level the comparison is flawed.

I also neglected to measure the PyTorch version on the M1 & Graviton which was a huge oversight!

There's a motivation behind these benchmarks. Looking at various solutions as improvements to existing transcription capabilities - each solution in my mind is a balance of accuracy, completeness, runtime, financial cost and energy efficiency.

On one end you have paying humans to do the transcription, slow and expensive but very accurate and something that is still done at a massive scale in my industry. At the other end there are existing Kaldi models that are less accurate but incredibly fast for inference on the CPU and very cheap to run.

I feel larger transformer models like Whisper sit somewhat in the middle of all this - closer to human accuracy but increased associated costs over existing software.

But whisper.cpp adds to this, if we can get similar or even just acceptable accuracy and runtime but on commodity hardware the choice can start to become more about cost, efficiency and functionality. e.g. you could buy 30+ Apple Macs for the price of an NVIDIA A100 server, being able to run Whisper on a laptop enables a different set of use cases, you can cut power consumption by a huge margin, etc

I think for me this is one of the many exciting outcomes of this project :)

Answer 47 · 2022-12-16T17:12:07.000Z

@matth
Yeah - the default in PyTorch when running from the command line is BeamSearch.
I haven't measure exactly, but it is significantly slower compared to Greedy.

I think regarding the total-time benchmark - it can make sense once whisper.cpp reaches the accuracy of OpenAI. Currently, due to the inferior decoding, whisper.cpp has lower transcription accuracy (based on some results I saw floating around). But when the decoding gets improved and we have comparable accuracy, then we can make a benchmark that says:

"for a given word error rate (WER) the 2 implementation take this amount of processing time on average, over some large set of audio"

And another thing I was thinking is that even if today whisper.cpp is more efficient on Apple Macs - it is not going to be always the case. If I understand correctly, it's just a matter of time for the proper Apple Silicon frameworks (Metal, MPS, etc.) to become supported in PyTorch, Tensorflow, etc and when this happens (probably very soon), the performance of whisper.cpp will be the same or possibly worse.

So yeah - just trying to adjust expectations :) Will probably write some more on this in the F.A.Q. discussion.

Answer 48 · 2022-12-23T12:30:17.000Z

CPU	OS	Config	Model	Th	Load	Enc.	Commit
i9-9900K @ 3.60GHz	macOS 12.6.2	AVX2 BLAS	tiny.en	4	175	360	`7282e21`
i9-9900K @ 3.60GHz	macOS 12.6.2	AVX2 BLAS	base.en	4	233	736	`7282e21`
i9-9900K @ 3.60GHz	macOS 12.6.2	AVX2 BLAS	small.en	4	507	2400	`7282e21`
i9-9900K @ 3.60GHz	macOS 12.6.2	AVX2 BLAS	medium.en	4	1333	6860	`7282e21`

Using 8 threads is slightly slower to load, faster to encode:

CPU	OS	Config	Model	Th	Load	Enc.	Commit
i9-9900K @ 3.60GHz	macOS 12.6.2	AVX2 BLAS	tiny.en	8	185	283	`7282e21`
i9-9900K @ 3.60GHz	macOS 12.6.2	AVX2 BLAS	base.en	8	241	579	`7282e21`
i9-9900K @ 3.60GHz	macOS 12.6.2	AVX2 BLAS	small.en	8	526	1959	`7282e21`
i9-9900K @ 3.60GHz	macOS 12.6.2	AVX2 BLAS	medium.en	8	1390	6271	`7282e21`

Answer 49 · 2022-12-29T22:10:29.000Z

CPU	OS	Config	Model	Th	Load	Enc.	Commit
MacBookPro M1 Max	macOS 12.6	NEON BLAS	tiny	8	65	108	`a593b93`
MacBookPro M1 Max	macOS 12.6	NEON BLAS	base	8	86	250	`a593b93`
MacBookPro M1 Max	macOS 12.6	NEON BLAS	small	8	185	789	`a593b93`
MacBookPro M1 Max	macOS 12.6	NEON BLAS	medium	8	493	2126	`a593b93`
MacBookPro M1 Max	macOS 12.6	NEON BLAS	large	8	955	3860	`a593b93`

There are actually 10 threads, but when using -t 10 the performance goes down. Lower numbers (such as -t 4) result in similar load performance, but slower encode (although not linear).

Answer 50 · 2023-01-02T16:09:56.000Z

AMD Ryzen 5 3400G (4 CPU cores, 8 threads) on Ubuntu 22.10 with 5.19.0-26-generic Kernel

4 threads

CPU	OS	Config	Model	Th	Load	Enc.	Commit
3400G	Ubuntu 22.10	AVX2	tiny	4	163	1415	`0be6a1a`
3400G	Ubuntu 22.10	AVX2	tiny.en	4	175	1351	`0be6a1a`
3400G	Ubuntu 22.10	AVX2	base.en	4	200	3095	`0be6a1a`
3400G	Ubuntu 22.10	AVX2	base	4	205	3241	`0be6a1a`
3400G	Ubuntu 22.10	AVX2	small.en	4	412	12343	`0be6a1a`
3400G	Ubuntu 22.10	AVX2	small	4	421	11983	`0be6a1a`
3400G	Ubuntu 22.10	AVX2	medium.en	4	995	38818	`0be6a1a`
3400G	Ubuntu 22.10	AVX2	medium	4	1006	38573	`0be6a1a`
3400G	Ubuntu 22.10	AVX2	large-v1	4			`0be6a1a`
3400G	Ubuntu 22.10	AVX2	large	4	1870	77302	`0be6a1a`

8 threads is just marginally better

CPU	OS	Config	Model	Th	Load	Enc.	Commit
3400G	Ubuntu 22.10	AVX2	tiny.en	8	191	1275	`0be6a1a`
3400G	Ubuntu 22.10	AVX2	tiny	8	183	1258	`0be6a1a`
3400G	Ubuntu 22.10	AVX2	base.en	8	232	2894	`0be6a1a`
3400G	Ubuntu 22.10	AVX2	base	8	231	2927	`0be6a1a`
3400G	Ubuntu 22.10	AVX2	small.en	8	435	11299	`0be6a1a`
3400G	Ubuntu 22.10	AVX2	small	8	414	11511	`0be6a1a`
3400G	Ubuntu 22.10	AVX2	medium.en	8	1011	37557	`0be6a1a`
3400G	Ubuntu 22.10	AVX2	medium	8	1049	37306	`0be6a1a`
3400G	Ubuntu 22.10	AVX2	large-v1	8			`0be6a1a`
3400G	Ubuntu 22.10	AVX2	large	8	3237	77396	`0be6a1a`

Someone mentioned BLAS?

Answer 51 · 2023-01-03T15:27:51.000Z

Whats the performance gain of this against the original implementation with pytorch compiled with AVX support or the pytorch m1 backend?

Does this implementation use beam decoding? (original pytorch impl has n=5 as default and is 100% faster with n=1)

Edit: README already mentions it's greedy decoding:

Very basic greedy sampling scheme - always pick up the token with highest probability. This should be similar to the GreedyDecoder from the original python implementation, so in order to make a fair comparison between the 2 implementations, make sure to run the python code with the following parameters:

whisper --best_of None --beam_size None ...

Greedy decoding is also 2x faster in the original implementation (on a GPU).

Answer 52 · 2023-01-03T19:37:25.000Z

Orange Pi5 4Gb, Micro-SD not NVME

Starting to touch zram swap on medium and then file swap pretty hard on large

CPU	OS	Config	Model	Th	Load	Enc.	Commit
rk3588s	Bullseye 5.10.110	NEON	tiny	8	352	2876	`0be6a1a`
rk3588s	Bullseye 5.10.110	NEON	base	8	346	6213	`0be6a1a`
rk3588s	Bullseye 5.10.110	NEON	small	8	690	25808	`0be6a1a`
rk3588s	Bullseye 5.10.110	NEON	medium	8	23987	93995	`0be6a1a`
rk3588s	Bullseye 5.10.110	NEON	large	8	49633	190601	`0be6a1a`

Even with a 4:4 big:little its a touch faster taskset -c 4-7 ./extra/bench-all.sh

CPU	OS	Config	Model	Th	Load	Enc.	Commit
rk3588s	Bullseye 5.10.110	NEON	tiny	4	356	2716	`0be6a1a`
rk3588s	Bullseye 5.10.110	NEON	base	4	417	6661	`0be6a1a`
rk3588s	Bullseye 5.10.110	NEON	small	4	943	25357	`0be6a1a`
rk3588s	Bullseye 5.10.110	NEON	medium	4	17748	90187	`0be6a1a`
rk3588s	Bullseye 5.10.110	NEON	large	4	48793	182800	`0be6a1a`

Compiling on a rk3588 with -march=native -ffast-math seems to give a big boost taskset -c 4-7 ./extra/bench-all.sh

CPU	OS	Config	Model	Th	Load	Enc.	Commit
rk3588s	Bullseye 5.10.110	NEON	tiny	4	280	1074	`0be6a1a`
rk3588s	Bullseye 5.10.110	NEON	base	4	466	3491	`0be6a1a`
rk3588s	Bullseye 5.10.110	NEON	small	4	780	11052	`0be6a1a`
rk3588s	Bullseye 5.10.110	NEON	medium	4	15361	42252	`0be6a1a`
rk3588s	Bullseye 5.10.110	NEON	large	4	49331	91892	`0be6a1a`

Answer 53 · 2023-01-07T16:53:37.000Z

Intel Celeron N4120 (4 cores, 4 threads) on Artix Linux 6.0.12-artix1-1.

CPU	OS	Config	Model	Th	Load	Enc.	Commit
N4120	Artix 6.0.12-artix1-1	BLAS	tiny	4	330	12272	`65fdcbb`
N4120	Artix 6.0.12-artix1-1	BLAS	base	4			`65fdcbb`
N4120	Artix 6.0.12-artix1-1	BLAS	small	4	892	83209	`65fdcbb`
N4120	Artix 6.0.12-artix1-1	BLAS	medium	4	5478	237677	`65fdcbb`

Answer 54 · 2023-01-11T05:39:02.000Z

Base 14 inch M1 Macbook Pro with NEON enabled:

CPU	OS	Config	RAM (GB)	Th	Model	Load (ms)	Enc. (ms)	Total
M1 Pro	OSX 12.5.1	NEON	16	8	Tiny.en	107	269.72	376.91
M1 Pro	OSX 12.5.1	NEON	16	8	Base.en	92	321	413.77
M1 Pro	OSX 12.5.1	NEON	16	8	Small.en	264	978	1243.24

16 Inch Base Apple M2 Pro results

CPU	OS	Config	RAM (GB)	Th	Model	Load (ms)	Enc. (ms)	Total (ms)
M2 Pro	OSX 13.2	NEON	16	8	Tiny.en	118	143	261
M2 Pro	OSX 13.2	NEON	16	8	Tiny	118	143	261
M2 Pro	OSX 13.2	NEON	16	8	Base.en	173	235	408
M2 Pro	OSX 13.2	NEON	16	8	Base	148	266	414
M2 Pro	OSX 13.2	NEON	16	8	Small.en	304	739	1042
M2 Pro	OSX 13.2	NEON	16	8	Small	277(?)	720	997
M2 Pro	OSX 13.2	NEON	16	8	Medium.en	747	2057	2804
M2 Pro	OSX 13.2	NEON	16	8	Medium	657	2055	2712
M2 Pro	OSX 13.2	NEON	16	8	Large	2126	4223	6349

I couldn't get bench to run on my iPhone 12, so I have attached my ad-hoc results below with the input audio "I love transcriber apps":

CPU	DGGML_USE_ACCELERATE	OS	Model	Load	Mel	Sample	Enc.	Dec.	Total (ms)
A14	Release	IOS 16.1	Base.en	150	23	2	2447	112	2584

--

This might appear obvious to some, but it wasn't to me so I'll note it here: I saw much better results using large steps lengths and sample sizes with "./stream". I feel like under the hood, Whisper relies greatly on 'whole-sentence' context to infer individual words.

Answer 55 · 2023-01-16T12:26:29.000Z

With the new beta 1.1.0 release. On first notice, not to much difference. Will not rebuild without OpenBLAS as it was slightly better with it on the rpi4.

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	4	751	9506	ecda7f786a
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny.en	4	748	9295	ecda7f786a
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	4	971	23512	ecda7f786a
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base.en	4	958	24263	ecda7f786a
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	small	4	2238	84720	ecda7f786a
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	small.en	4	3880	86031	ecda7f786a

Answer 56 · 2023-01-17T02:09:58.000Z

Results on 12th Gen Intel(R) Core(TM) i3-12300T:

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Core i3-12300T	Debian 11 (Docker on Win11)	AVX2	tiny.en	4	97	679	`49b529b`
Core i3-12300T	Debian 11 (Docker on Win11)	AVX2	tiny	4	90	580	`49b529b`
Core i3-12300T	Debian 11 (Docker on Win11)	AVX2	base	4	138	1478	`49b529b`

With OpenBLAS (considerably worse):

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Core i3-12300T	Debian 11 (Docker on Win11)	AVX2 BLAS	tiny	4	117	1644	`49b529b`
Core i3-12300T	Debian 11 (Docker on Win11)	AVX2 BLAS	base	4	122	2890	`49b529b`

Answer 57 · 2023-01-20T01:27:58.000Z

The benchmarks for the macbook pro m1 are using 8 threads, but in my experience it runs nearly twice as fast with 4 threads. Am I missing something?

Edit:
I just ran the benchmark with the large model.. and it actually made almost no difference whether 8 or 4 threads were used. But with real world workloads it makes a huge difference. Interesting.

Answer 58 · 2023-01-20T02:08:42.000Z

Running memcpy benchmark with 1 thread
memcpy: 8.66 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat:    64 x    64: F16      4.2 GFLOPS (128 runs) / F32      3.5 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     10.1 GFLOPS (128 runs) / F32      6.3 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     13.0 GFLOPS (128 runs) / F32      7.2 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     14.0 GFLOPS ( 53 runs) / F32      7.1 GFLOPS ( 27 runs)
ggml_mul_mat:  1024 x  1024: F16     29.8 GFLOPS ( 15 runs) / F32     17.8 GFLOPS (  9 runs)
ggml_mul_mat:  2048 x  2048: F16     37.8 GFLOPS (  3 runs) / F32     19.6 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     40.0 GFLOPS (  3 runs) / F32     17.4 GFLOPS (  3 runs)

Running benchmark for all models

CPU	OS	Config	Model	Th	Load	Enc.	Commit
rk3588s	Ubuntu 22.04	NEON	tiny	4	257	1179	`21c569b`
rk3588s	Ubuntu 22.04	NEON	base	4	326	2967	`21c569b`
rk3588s	Ubuntu 22.04	NEON	small	4	661	10560	`21c569b`
rk3588s	Ubuntu 22.04	NEON	medium	4	23188	35867	`21c569b`

Answer 59 · 2023-01-23T08:25:50.000Z

Compiler: gcc version 12.2.0 (Ubuntu 12.2.0-3ubuntu1)

memcpy: 16.74 GB/s
sum:    error -536870997.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16     16.2 GFLOPS (128 runs) / F32     16.4 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     70.1 GFLOPS (128 runs) / F32     66.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    133.9 GFLOPS (128 runs) / F32    105.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    161.2 GFLOPS (128 runs) / F32    109.3 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    204.4 GFLOPS ( 96 runs) / F32    121.9 GFLOPS ( 57 runs)
ggml_mul_mat:  2048 x  2048: F16    254.4 GFLOPS ( 15 runs) / F32    149.3 GFLOPS (  9 runs)
ggml_mul_mat:  4096 x  4096: F16    184.2 GFLOPS (  3 runs) / F32     54.1 GFLOPS (  3 runs)

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat:    64 x    64: F16      8.4 GFLOPS (128 runs) / F32      9.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     58.1 GFLOPS (128 runs) / F32     57.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    170.3 GFLOPS (128 runs) / F32    159.9 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    315.7 GFLOPS (128 runs) / F32    230.8 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    356.0 GFLOPS (128 runs) / F32    224.9 GFLOPS (105 runs)
ggml_mul_mat:  2048 x  2048: F16    499.5 GFLOPS ( 30 runs) / F32    292.4 GFLOPS ( 18 runs)
ggml_mul_mat:  4096 x  4096: F16    265.9 GFLOPS (  3 runs) / F32     66.2 GFLOPS (  3 runs)

Running ggml_mul_mat benchmark with 16 threads

ggml_mul_mat:    64 x    64: F16      3.6 GFLOPS (128 runs) / F32      3.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     16.7 GFLOPS (128 runs) / F32     27.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     88.1 GFLOPS (128 runs) / F32    126.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    263.5 GFLOPS (128 runs) / F32    229.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    396.1 GFLOPS (128 runs) / F32    272.8 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    498.6 GFLOPS ( 30 runs) / F32    314.9 GFLOPS ( 19 runs)
ggml_mul_mat:  4096 x  4096: F16    337.7 GFLOPS (  3 runs) / F32    112.0 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	tiny.en	4	104	247	`78f1661`
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	base.en	4	130	585	`78f1661`
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	small.en	4	264	1940	`78f1661`
---	--	------	-----	--	----	----	------
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	tiny.en	8	99	166	`78f1661`
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	base.en	8	123	329	`78f1661`
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	small.en	8	262	1148	`78f1661`
---	--	------	-----	--	----	----	------
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	tiny.en	16	100	160	`78f1661`
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	base.en	16	123	338	`78f1661`
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	small.en	16	262	1139	`78f1661`

Answer 60 · 2023-01-24T04:52:22.000Z

Tested on my M2 Macbook Air:

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

./extra/bench-all.sh
Running memcpy benchmark with 1 thread
memcpy: 31.42 GB/s
sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 11.8 GFLOPS (128 runs) / F32 10.6 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 89.9 GFLOPS (128 runs) / F32 74.7 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 434.5 GFLOPS (128 runs) / F32 419.9 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 885.4 GFLOPS (128 runs) / F32 913.2 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 1023.4 GFLOPS (128 runs) / F32 1037.7 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 971.6 GFLOPS ( 57 runs) / F32 950.1 GFLOPS ( 56 runs)
ggml_mul_mat: 4096 x 4096: F16 914.9 GFLOPS ( 7 runs) / F32 820.7 GFLOPS ( 6 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
M2	OSX 13.0.1	NEON BLAS	tiny	4	63	153	`1a91c19`
M2	OSX 13.0.1	NEON BLAS	base	4	92	329	`1a91c19`
M2	OSX 13.0.1	NEON BLAS	small	4	198	1014	`1a91c19`
M2	OSX 13.0.1	NEON BLAS	medium	4	564	3042	`1a91c19`
M2	OSX 13.0.1	NEON BLAS	large	4	1152	5466	`1a91c19`

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat: 64 x 64: F16 5.7 GFLOPS (128 runs) / F32 3.9 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 45.0 GFLOPS (128 runs) / F32 25.8 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 272.7 GFLOPS (128 runs) / F32 166.1 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 747.6 GFLOPS (128 runs) / F32 748.8 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 998.7 GFLOPS (128 runs) / F32 895.8 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 716.0 GFLOPS ( 42 runs) / F32 717.2 GFLOPS ( 42 runs)
ggml_mul_mat: 4096 x 4096: F16 790.4 GFLOPS ( 6 runs) / F32 726.3 GFLOPS ( 6 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
M2	OSX 13.0.1	NEON BLAS	tiny	8	66	154	`1a91c19`
M2	OSX 13.0.1	NEON BLAS	base	8	92	346	`1a91c19`
M2	OSX 13.0.1	NEON BLAS	small	8	211	1171	`1a91c19`
M2	OSX 13.0.1	NEON BLAS	medium	8	562	3848	`1a91c19`
M2	OSX 13.0.1	NEON BLAS	large	8	1079	6230	`1a91c19`

Answer 61 · 2023-01-26T04:46:01.000Z

This is bench result :

whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem required = 500.00 MB (+ 6.00 MB per decoder)
whisper_model_load: kv self size = 5.25 MB
whisper_model_load: kv cross size = 17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 140.60 MB
whisper_model_load: model size = 140.54 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: load time = 1245.39 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 88596.32 ms / 1 runs (88596.32 ms per run)
whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 89841.85 ms

This is cpuinfo :

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping : 7
microcode : 0x2f
cpu MHz : 2990.383
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown
bogomips : 4983.97
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping : 7
microcode : 0x2f
cpu MHz : 2990.384
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown
bogomips : 4983.97
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping : 7
microcode : 0x2f
cpu MHz : 2990.384
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown
bogomips : 4983.97
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping : 7
microcode : 0x2f
cpu MHz : 2990.384
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown
bogomips : 4983.97
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

./bench -w 1 -t 1

memcpy: 3.35 GB/s
sum: error -536870997.000000
./bench -w 2 -t 1

ggml_mul_mat: 64 x 64: F16 0.7 GFLOPS (128 runs) / F32 3.3 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 0.7 GFLOPS (128 runs) / F32 3.7 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 0.6 GFLOPS ( 18 runs) / F32 3.3 GFLOPS ( 99 runs)
ggml_mul_mat: 512 x 512: F16 0.6 GFLOPS ( 3 runs) / F32 3.6 GFLOPS ( 14 runs)
ggml_mul_mat: 1024 x 1024: F16 0.7 GFLOPS ( 3 runs) / F32 2.3 GFLOPS ( 3 runs)
ggml_mul_mat: 2048 x 2048: F16 0.7 GFLOPS ( 3 runs) / F32 2.4 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 1.2 GFLOPS ( 3 runs) / F32 3.0 GFLOPS ( 3 runs)

Thinkpad T520, on Linux Mint Debian Edition, with commented out AVX1 on Makefile

Answer 62 · 2023-02-02T23:11:02.000Z

Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 38.84 GB/s
sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 9.8 GFLOPS (128 runs) / F32 8.4 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 69.4 GFLOPS (128 runs) / F32 62.1 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 455.3 GFLOPS (128 runs) / F32 383.8 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 1141.1 GFLOPS (128 runs) / F32 1550.2 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 2302.0 GFLOPS (128 runs) / F32 2962.9 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 3035.6 GFLOPS (128 runs) / F32 3217.5 GFLOPS (128 runs)
ggml_mul_mat: 4096 x 4096: F16 3431.7 GFLOPS ( 25 runs) / F32 3510.6 GFLOPS ( 26 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
M1 Ultra	13.2	NEON BLAS	tiny	4	71	139	`2bee265`
M1 Ultra	13.2	NEON BLAS	base	4	95	266	`2bee265`
M1 Ultra	13.2	NEON BLAS	small	4	222	806	`2bee265`
M1 Ultra	13.2	NEON BLAS	medium	4	598	2175	`2bee265`
M1 Ultra	13.2	NEON BLAS	large	4	1165	3895	`2bee265`

Answer 63 · 2023-02-16T12:37:56.000Z

Here are new results for POWER9, now that #300 is closed.

Running memcpy benchmark with 1 thread

memcpy: 6.32 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 32 threads

ggml_mul_mat:    64 x    64: F16      0.4 GFLOPS (128 runs) / F32      0.4 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16      2.8 GFLOPS (128 runs) / F32      2.8 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     13.4 GFLOPS (128 runs) / F32     23.0 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     32.9 GFLOPS (123 runs) / F32     87.9 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16     47.9 GFLOPS ( 23 runs) / F32    127.4 GFLOPS ( 60 runs)
ggml_mul_mat:  2048 x  2048: F16     58.5 GFLOPS (  4 runs) / F32     67.3 GFLOPS (  4 runs)
ggml_mul_mat:  4096 x  4096: F16     23.8 GFLOPS (  3 runs) / F32     21.2 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Model	Th	Load	Enc.	Commit	Compiler
POWER9	Debian 11	tiny	32	75	1283	`3b010f9`	GCC 10.2.1
POWER9	Debian 11	base	32	96	2786	`3b010f9`	GCC 10.2.1
POWER9	Debian 11	small	32	182	8534	`3b010f9`	GCC 10.2.1
POWER9	Debian 11	medium	32	463	22282	`3b010f9`	GCC 10.2.1
POWER9	Debian 11	large	32	838	41106	`3b010f9`	GCC 10.2.1

Answer 64 · 2023-02-24T13:33:22.000Z

I got referred here from openai/whisper#978 (comment)
This seems really interesting.

I'm running on Oracle Cloud's free tier, which contains 4x Ampere A1 CPUs and 24G RAM.

Compiler:

I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

Default (no changes)

~/whisper.cpp$ extra/bench-all.sh
Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 10.92 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      1.0 GFLOPS (128 runs) / F32      0.7 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     16.8 GFLOPS (128 runs) / F32     13.2 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     18.5 GFLOPS (128 runs) / F32     41.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     21.5 GFLOPS ( 81 runs) / F32     35.4 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16     23.2 GFLOPS ( 11 runs) / F32     41.4 GFLOPS ( 20 runs)
ggml_mul_mat:  2048 x  2048: F16     23.4 GFLOPS (  3 runs) / F32     32.6 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     22.5 GFLOPS (  3 runs) / F32     21.4 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ampere A1	Ubuntu 22.04	NEON	tiny	4	83	1832	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	base	4	120	4767	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	small	4	273	17529	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	medium	4	739	59794	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	large	4	1436	115771	`ca21f7a`

With changes mentioned in openai/whisper#978 (comment)
Thanks again @jan-grzybek-ampere

~/whisper.cpp$ extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 10.88 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      2.0 GFLOPS (128 runs) / F32      1.7 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     14.3 GFLOPS (128 runs) / F32     33.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     40.7 GFLOPS (128 runs) / F32     54.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     97.5 GFLOPS (128 runs) / F32     31.4 GFLOPS (117 runs)
ggml_mul_mat:  1024 x  1024: F16     87.1 GFLOPS ( 41 runs) / F32     41.0 GFLOPS ( 20 runs)
ggml_mul_mat:  2048 x  2048: F16     74.3 GFLOPS (  5 runs) / F32     33.4 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     50.4 GFLOPS (  3 runs) / F32     21.5 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ampere A1	Ubuntu 22.04	NEON	tiny	4	84	619	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	base	4	124	2036	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	small	4	293	5872	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	medium	4	817	22064	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	large	4	1446	37996	`ca21f7a`

Answer 65 · 2023-02-25T02:05:52.000Z

Done a bit of reading and done several more tests.

According to https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu , the recommendation is to use -mcpu=native and I did indeed get the best performance with it.
Will put in a pull request to use -mcpu=native for aarch64.
No significant difference between GCC 11.3 and GCC 12.1 on Ubuntu 22.04.

-march=armv8.2-a+fp16, gcc-11.3

Performance seems slightly worse compared to yesterday's test in #89 (comment)
I re-ran all of the following tests one after another to hopefully obtain comparable figures.
This is a free instance on Oracle Cloud and perhaps others are using the other cores on the CPU.

make clean
make main bench
./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 10.82 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      1.8 GFLOPS (128 runs) / F32      2.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     40.7 GFLOPS (128 runs) / F32     12.7 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     52.9 GFLOPS (128 runs) / F32     32.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     97.3 GFLOPS (128 runs) / F32     32.1 GFLOPS (120 runs)
ggml_mul_mat:  1024 x  1024: F16     77.0 GFLOPS ( 36 runs) / F32     35.1 GFLOPS ( 17 runs)
ggml_mul_mat:  2048 x  2048: F16     64.0 GFLOPS (  4 runs) / F32     25.9 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     45.8 GFLOPS (  3 runs) / F32     21.0 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ampere A1	Ubuntu 22.04	NEON	tiny	4	85	662	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	base	4	121	2039	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	small	4	281	6667	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	medium	4	760	25355	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	large	4	1456	45563	`ca21f7a`

-mcpu=native, gcc-11.3

make clean
make main bench
./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 10.85 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      7.9 GFLOPS (128 runs) / F32      1.8 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16      7.5 GFLOPS (128 runs) / F32     12.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     51.8 GFLOPS (128 runs) / F32     54.4 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     96.3 GFLOPS (128 runs) / F32     31.2 GFLOPS (117 runs)
ggml_mul_mat:  1024 x  1024: F16     74.1 GFLOPS ( 35 runs) / F32     33.5 GFLOPS ( 16 runs)
ggml_mul_mat:  2048 x  2048: F16     67.1 GFLOPS (  4 runs) / F32     27.0 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     49.3 GFLOPS (  3 runs) / F32     21.7 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ampere A1	Ubuntu 22.04	NEON	tiny	4	85	655	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	base	4	121	2002	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	small	4	283	6923	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	medium	4	762	24085	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	large	4	1459	43846	`ca21f7a`

-mcpu=native, gcc-12.1

make clean
make CC=gcc-12 CXX=g++-12 main bench
./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 11.01 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      8.0 GFLOPS (128 runs) / F32      8.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     12.0 GFLOPS (128 runs) / F32     12.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     55.7 GFLOPS (128 runs) / F32     41.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     95.1 GFLOPS (128 runs) / F32     30.2 GFLOPS (113 runs)
ggml_mul_mat:  1024 x  1024: F16     67.1 GFLOPS ( 32 runs) / F32     33.0 GFLOPS ( 16 runs)
ggml_mul_mat:  2048 x  2048: F16     64.2 GFLOPS (  4 runs) / F32     26.8 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     46.1 GFLOPS (  3 runs) / F32     21.4 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ampere A1	Ubuntu 22.04	NEON	tiny	4	84	613	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	base	4	122	2086	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	small	4	286	6375	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	medium	4	761	24667	`ca21f7a`
Ampere A1	Ubuntu 22.04	NEON	large	4	1457	43826	`ca21f7a`

Answer 66 · 2023-02-25T03:44:30.000Z

I confirmed your findings, and interestingly enough, I found the performance worse with OpenBLAS.

…

On Sat, 25 Feb 2023 at 12:06, FlippFuzz ***@***.***> wrote: Done a bit of reading and done several more tests. According to https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu , the recommendation is to use -mcpu=native and I did indeed get the best performance with it. Will put in a pull request to use -mcpu=native for aarch64. No significant difference between GCC 11.3 and GCC 12.1 on Ubuntu 22.04. ------------------------------ -march=armv8.2-a+fp16, gcc-11.3 Performance seems slightly worse compared to yesterday's test in #89 (comment) <#89 (comment)> I re-ran all of the following tests one after another to hopefully obtain comparable figures. This is a free instance on Oracle Cloud and perhaps others are using the other cores on the CPU. make clean make main bench ./extra/bench-all.sh Running memcpy benchmark with 1 thread memcpy: 10.82 GB/s sum: error 136902082731.000000 Running ggml_mul_mat benchmark with 4 threads ggml_mul_mat: 64 x 64: F16 1.8 GFLOPS (128 runs) / F32 2.0 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 40.7 GFLOPS (128 runs) / F32 12.7 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 52.9 GFLOPS (128 runs) / F32 32.8 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 97.3 GFLOPS (128 runs) / F32 32.1 GFLOPS (120 runs) ggml_mul_mat: 1024 x 1024: F16 77.0 GFLOPS ( 36 runs) / F32 35.1 GFLOPS ( 17 runs) ggml_mul_mat: 2048 x 2048: F16 64.0 GFLOPS ( 4 runs) / F32 25.9 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 45.8 GFLOPS ( 3 runs) / F32 21.0 GFLOPS ( 3 runs) CPU OS Config Model Th Load Enc. Commit Ampere A1 Ubuntu 22.04 NEON tiny 4 85 662 ca21f7a <ca21f7a> Ampere A1 Ubuntu 22.04 NEON base 4 121 2039 ca21f7a <ca21f7a> Ampere A1 Ubuntu 22.04 NEON small 4 281 6667 ca21f7a <ca21f7a> Ampere A1 Ubuntu 22.04 NEON medium 4 760 25355 ca21f7a <ca21f7a> Ampere A1 Ubuntu 22.04 NEON large 4 1456 45563 ca21f7a <ca21f7a> ------------------------------ -mcpu=native, gcc-11.3 make clean make main bench ./extra/bench-all.sh Running memcpy benchmark with 1 thread memcpy: 10.85 GB/s sum: error 136902082731.000000 Running ggml_mul_mat benchmark with 4 threads ggml_mul_mat: 64 x 64: F16 7.9 GFLOPS (128 runs) / F32 1.8 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 7.5 GFLOPS (128 runs) / F32 12.6 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 51.8 GFLOPS (128 runs) / F32 54.4 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 96.3 GFLOPS (128 runs) / F32 31.2 GFLOPS (117 runs) ggml_mul_mat: 1024 x 1024: F16 74.1 GFLOPS ( 35 runs) / F32 33.5 GFLOPS ( 16 runs) ggml_mul_mat: 2048 x 2048: F16 67.1 GFLOPS ( 4 runs) / F32 27.0 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 49.3 GFLOPS ( 3 runs) / F32 21.7 GFLOPS ( 3 runs) CPU OS Config Model Th Load Enc. Commit Ampere A1 Ubuntu 22.04 NEON tiny 4 85 655 ca21f7a <ca21f7a> Ampere A1 Ubuntu 22.04 NEON base 4 121 2002 ca21f7a <ca21f7a> Ampere A1 Ubuntu 22.04 NEON small 4 283 6923 ca21f7a <ca21f7a> Ampere A1 Ubuntu 22.04 NEON medium 4 762 24085 ca21f7a <ca21f7a> Ampere A1 Ubuntu 22.04 NEON large 4 1459 43846 ca21f7a <ca21f7a> ------------------------------ -mcpu=native, gcc-12.1 make clean make CC=gcc-12 CXX=g++-12 main bench ./extra/bench-all.sh Running memcpy benchmark with 1 thread memcpy: 11.01 GB/s sum: error 136902082731.000000 Running ggml_mul_mat benchmark with 4 threads ggml_mul_mat: 64 x 64: F16 8.0 GFLOPS (128 runs) / F32 8.0 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 12.0 GFLOPS (128 runs) / F32 12.6 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 55.7 GFLOPS (128 runs) / F32 41.8 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 95.1 GFLOPS (128 runs) / F32 30.2 GFLOPS (113 runs) ggml_mul_mat: 1024 x 1024: F16 67.1 GFLOPS ( 32 runs) / F32 33.0 GFLOPS ( 16 runs) ggml_mul_mat: 2048 x 2048: F16 64.2 GFLOPS ( 4 runs) / F32 26.8 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 46.1 GFLOPS ( 3 runs) / F32 21.4 GFLOPS ( 3 runs) CPU OS Config Model Th Load Enc. Commit Ampere A1 Ubuntu 22.04 NEON tiny 4 84 613 ca21f7a <ca21f7a> Ampere A1 Ubuntu 22.04 NEON base 4 122 2086 ca21f7a <ca21f7a> Ampere A1 Ubuntu 22.04 NEON small 4 286 6375 ca21f7a <ca21f7a> Ampere A1 Ubuntu 22.04 NEON medium 4 761 24667 ca21f7a <ca21f7a> Ampere A1 Ubuntu 22.04 NEON large 4 1457 43826 ca21f7a <ca21f7a> — Reply to this email directly, view it on GitHub <#89 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALQR6226FF7KCLW6VNZK6TWZFSIZANCNFSM6AAAAAAROFTFJE> . You are receiving this because you commented.Message ID: ***@***.***>

-- Sincerely Jay

Answer 67 · 2023-03-01T21:37:50.000Z

CPU model: AMD Ryzen 9 7950X
Operating system: Windows 10 Pro N 22H2
Compiler: Windows x64 release v1.2.1

whisper-bin-x64

>bench.exe
whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: kv self size  =    5.25 MB
whisper_model_load: kv cross size =   17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   109.45 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =   919.30 ms /     1 runs (  919.30 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1032.75 ms

>bench -w 1 -t 1
memcpy: 24.58 GB/s
sum:    error -536870819.000000

>bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     22.7 GFLOPS (128 runs) / F32     38.7 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     34.6 GFLOPS (128 runs) / F32     45.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     44.2 GFLOPS (128 runs) / F32     54.5 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     50.5 GFLOPS (128 runs) / F32     55.3 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16     53.2 GFLOPS ( 25 runs) / F32     65.7 GFLOPS ( 31 runs)
ggml_mul_mat:  2048 x  2048: F16     54.9 GFLOPS (  4 runs) / F32     61.8 GFLOPS (  4 runs)
ggml_mul_mat:  4096 x  4096: F16     50.7 GFLOPS (  3 runs) / F32     19.9 GFLOPS (  3 runs)

That last one is less than the 5950X above, weird. Oh, OpenBLAS below:

whisper-blas-bin-x64

>bench
whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: kv self size  =    5.25 MB
whisper_model_load: kv cross size =   17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   101.76 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =   602.63 ms /     1 runs (  602.63 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   705.80 ms

>bench -w 1 -t 1
memcpy: 24.30 GB/s
sum:    error -536870819.000000

>bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     89.4 GFLOPS (128 runs) / F32    119.6 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     27.6 GFLOPS (128 runs) / F32     31.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    172.9 GFLOPS (128 runs) / F32    222.0 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    596.8 GFLOPS (128 runs) / F32    926.4 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1257.0 GFLOPS (128 runs) / F32   1887.7 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1726.5 GFLOPS (101 runs) / F32   2193.9 GFLOPS (128 runs)
ggml_mul_mat:  4096 x  4096: F16   2109.8 GFLOPS ( 16 runs) / F32   2237.5 GFLOPS ( 17 runs)

Answer 68 · 2023-03-09T22:09:01.000Z

memcpy: 7.20 GB/s
sum: error -536870997.000000

CPU	OS	Config	Model	Th	Load	Enc.	Commit
AMD Ryzen 3 3200U	Linux Mint 21.1	AVX2	tiny	4	109	3417	`09e9068`
AMD Ryzen 3 3200U	Linux Mint 21.1	AVX2	base	4	180	7907	`09e9068`
AMD Ryzen 3 3200U	Linux Mint 21.1	AVX2	small	4	419	30899	`09e9068`
AMD Ryzen 3 3200U	Linux Mint 21.1	AVX2	medium	4	1851	106542	`09e9068`
AMD Ryzen 3 3200U	Linux Mint 21.1	AVX2	large	4	4715	203455	`09e9068`

Answer 69 · 2023-03-16T19:41:48.000Z

memcpy: 15.57 GB/s

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat: 64 x 64: F16 6.1 GFLOPS (128 runs) / F32 6.2 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 40.1 GFLOPS (128 runs) / F32 38.7 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 147.9 GFLOPS (128 runs) / F32 110.1 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 264.9 GFLOPS (128 runs) / F32 134.4 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 289.5 GFLOPS (128 runs) / F32 151.9 GFLOPS ( 71 runs)
ggml_mul_mat: 2048 x 2048: F16 290.6 GFLOPS ( 17 runs) / F32 70.7 GFLOPS ( 5 runs)
ggml_mul_mat: 4096 x 4096: F16 114.0 GFLOPS ( 3 runs) / F32 62.7 GFLOPS ( 3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	tiny	8	50	361	`09e9068`
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	base	8	70	1000	`09e9068`
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	small	8	185	2264	`09e9068`
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	medium	8	587	8421	`09e9068`
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	large	8	2296	15759	`09e9068`

Running ggml_mul_mat benchmark with 16 threads

ggml_mul_mat: 64 x 64: F16 2.1 GFLOPS (128 runs) / F32 1.9 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 19.6 GFLOPS (128 runs) / F32 14.8 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 68.1 GFLOPS (128 runs) / F32 84.5 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 200.5 GFLOPS (128 runs) / F32 141.4 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 271.0 GFLOPS (127 runs) / F32 163.7 GFLOPS ( 77 runs)
ggml_mul_mat: 2048 x 2048: F16 205.5 GFLOPS ( 12 runs) / F32 71.6 GFLOPS ( 5 runs)
ggml_mul_mat: 4096 x 4096: F16 142.3 GFLOPS ( 3 runs) / F32 63.0 GFLOPS ( 3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	tiny	16	52	329	`09e9068`
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	base	16	72	723	`09e9068`
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	small	16	188	2214	`09e9068`
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	medium	16	698	10889	`09e9068`
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	large	16	1619	16640	`09e9068`

Answer 70 · 2023-03-18T11:39:05.000Z

MacBook Pro 14" with M2 Pro

10 Cores, 16GB RAM
macOS Ventura 13.2
Benchmarks running at 8 threads

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Apple M2 Pro	macOS 13.2	NEON BLAS	tiny	8	76	161	`09e9068`
Apple M2 Pro	macOS 13.2	NEON BLAS	base	8	104	318	`09e9068`
Apple M2 Pro	macOS 13.2	NEON BLAS	small	8	221	975	`09e9068`
Apple M2 Pro	macOS 13.2	NEON BLAS	medium	8	969	2692	`09e9068`
Apple M2 Pro	macOS 13.2	NEON BLAS	large	8	1939	4959	`09e9068`

Answer 71 · 2023-03-19T20:58:20.000Z

NVIDIA Jetson Nano, without GPU optimization:
base-en

 ./bin/main -f samples/jfk.wav 
whisper_init_from_file_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   354.49 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   712.86 ms
whisper_print_timings:   sample time =    79.37 ms /    27 runs (    2.94 ms per run)
whisper_print_timings:   encode time = 24406.28 ms /     1 runs (24406.28 ms per run)
whisper_print_timings:   decode time =  1284.84 ms /    27 runs (   47.59 ms per run)
whisper_print_timings:    total time = 26908.31 ms

tiny-en

./bin/main -m ./models/ggml-tiny.en.bin  -f ./samples/jfk.wav 
whisper_init_from_file_no_state: loading model from './models/ggml-tiny.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  127.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   73.58 MB
whisper_model_load: model size    =   73.54 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.740]   And so my fellow Americans ask not what your country can do for you
[00:00:07.740 --> 00:00:10.740]   ask what you can do for your country


whisper_print_timings:     load time =   204.60 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   564.90 ms
whisper_print_timings:   sample time =    72.13 ms /    26 runs (    2.77 ms per run)
whisper_print_timings:   encode time =  9232.34 ms /     1 runs ( 9232.34 ms per run)
whisper_print_timings:   decode time =   616.00 ms /    26 runs (   23.69 ms per run)
whisper_print_timings:    total time = 10745.65 ms

Answer 72 · 2023-03-20T19:20:04.000Z

MacBook Pro 14" with M2 Pro
10 Cores, 32GB RAM
macOS Ventura 13.2
Benchmarks running at 8 threads
memcpy: 40.68 GB/s

| CPU          | OS     | Config     | Model    | Th | Load | Enc. | Commit  |
| ------------ | ------ | ---------- | -------- | -- | ---- | ---- | ------- |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | tiny     | 8  | 45   | 93   | 09e9068 |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | base     | 8  | 68   | 187  | 09e9068 |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | small    | 8  | 179  | 702  | 09e9068 |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | medium   | 8  | 496  | 2227 | 09e9068 |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | large    | 8  | 1037 | 3796 | 09e9068 |

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat:    64 x    64: F16      4.6 GFLOPS (128 runs) / F32      4.1 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     46.6 GFLOPS (128 runs) / F32     36.4 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    294.2 GFLOPS (128 runs) / F32    238.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    611.0 GFLOPS (128 runs) / F32    712.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    770.9 GFLOPS (128 runs) / F32    700.3 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    902.8 GFLOPS ( 53 runs) / F32    906.9 GFLOPS ( 53 runs)
ggml_mul_mat:  4096 x  4096: F16   1521.2 GFLOPS ( 12 runs) / F32   1469.3 GFLOPS ( 11 runs)

Answer 73 · 2023-04-03T01:19:07.000Z

MacBook Pro 16" with M2 Max
12 Cores, 96GB RAM
macOS Ventura 13.3
Benchmarks running at 4 threads (4 threads were faster than 8 threads for ggml_mul_mat but about same for model load/encode)
memcpy: 49.94 GB/s
sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16     11.2 GFLOPS (128 runs) / F32      9.3 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     83.0 GFLOPS (128 runs) / F32     73.7 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    505.2 GFLOPS (128 runs) / F32    488.2 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16   1018.0 GFLOPS (128 runs) / F32   1196.3 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1796.2 GFLOPS (128 runs) / F32   2087.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1638.8 GFLOPS ( 96 runs) / F32   1673.7 GFLOPS ( 98 runs)
ggml_mul_mat:  4096 x  4096: F16   1995.2 GFLOPS ( 15 runs) / F32   2037.8 GFLOPS ( 15 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Apple M2 Max	13.3	NEON BLAS	tiny	4	41	118	`0a2d121`
Apple M2 Max	13.3	NEON BLAS	base	4	61	230	`0a2d121`
Apple M2 Max	13.3	NEON BLAS	small	4	153	734	`0a2d121`
Apple M2 Max	13.3	NEON BLAS	medium	4	448	1979	`0a2d121`
Apple M2 Max	13.3	NEON BLAS	large	4	882	3553	`0a2d121`

Answer 74 · 2023-04-09T09:18:41.000Z

Running memcpy benchmark with 1 thread

memcpy: 7.03 GB/s
sum: error -536870997.000000 - how fix ??

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      8.9 GFLOPS (128 runs) / F32     10.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     53.3 GFLOPS (128 runs) / F32     47.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     91.7 GFLOPS (128 runs) / F32     99.4 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    134.2 GFLOPS (128 runs) / F32     94.8 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    182.9 GFLOPS ( 86 runs) / F32    121.2 GFLOPS ( 57 runs)
ggml_mul_mat:  2048 x  2048: F16    180.0 GFLOPS ( 11 runs) / F32     42.4 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     59.1 GFLOPS (  3 runs) / F32     31.5 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ryzen 7 PRO 5850U	Ubuntu 22.04.2	AVX2	tiny	4	69	495	`0a2d121`
Ryzen 7 PRO 5850U	Ubuntu 22.04.2	AVX2	base	4	111	1128	`0a2d121`
Ryzen 7 PRO 5850U	Ubuntu 22.04.2	AVX2	small	4	264	3992	`0a2d121`
Ryzen 7 PRO 5850U	Ubuntu 22.04.2	AVX2	medium	4	806	12230	`0a2d121`
Ryzen 7 PRO 5850U	Ubuntu 22.04.2	AVX2	large	4	1919	25574	`0a2d121`

Answer 75 · 2023-04-09T12:01:21.000Z

memcpy: 9.49 GB/s
sum: error -536870997.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 8.8 GFLOPS (128 runs) / F32 10.0 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 35.4 GFLOPS (128 runs) / F32 49.2 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 61.9 GFLOPS (128 runs) / F32 95.1 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 64.3 GFLOPS (128 runs) / F32 86.5 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 74.4 GFLOPS ( 35 runs) / F32 39.9 GFLOPS ( 19 runs)
ggml_mul_mat: 2048 x 2048: F16 56.9 GFLOPS ( 4 runs) / F32 31.1 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 56.9 GFLOPS ( 3 runs) / F32 30.1 GFLOPS ( 3 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ryzen 5 5500U	Ubuntu 22.04.2	AVX2	tiny	4	67	761	`0a2d121`
Ryzen 5 5500U	Ubuntu 22.04.2	AVX2	base	4	96	2040	`0a2d121`
Ryzen 5 5500U	Ubuntu 22.04.2	AVX2	small	4	239	7639	`0a2d121`
Ryzen 5 5500U	Ubuntu 22.04.2	AVX2	medium	4	657	23735	`0a2d121`
Ryzen 5 5500U	Ubuntu 22.04.2	AVX2	large	4	1302	45006	`0a2d121`

Answer 76 · 2023-04-13T12:52:18.000Z

HP Z440, Xeon E5-2690v4, 64Gb, Rocky Linux 9.1

memcpy: 10.94 GB/s
sum: error -536870997.000000

./bench -w 2
ggml_mul_mat: 64 x 64: F16 4.8 GFLOPS (128 runs) / F32 4.8 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 23.1 GFLOPS (128 runs) / F32 18.7 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 52.5 GFLOPS (128 runs) / F32 35.1 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 69.6 GFLOPS (128 runs) / F32 44.4 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 78.8 GFLOPS ( 37 runs) / F32 49.2 GFLOPS ( 23 runs)
ggml_mul_mat: 2048 x 2048: F16 83.6 GFLOPS ( 5 runs) / F32 50.8 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 64.5 GFLOPS ( 3 runs) / F32 21.8 GFLOPS ( 3 runs)

system_info: n_threads = 28 / 28 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

whisper_print_timings: load time = 1031.43 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 13121.63 ms / 1 runs (13121.63 ms per run)
whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 14219.33 ms

model: large

Answer 77 · 2023-04-13T22:06:29.000Z

very impressed

CPU	OS	Config	Model	Th	Load	Enc.	Commit
MacBook M1 Max	macOS 13.0 beta (22A5321d)	NEON BLAS	medium	8	488	2344	`0a2d121`
MacBook M1 Max	macOS 13.0 beta (22A5321d)	NEON BLAS	large	8	1070	3209	`0a2d121`

Answer 78 · 2023-04-14T02:54:17.000Z

What am I doing wrong? 17.6 GFlops on a Ryzen 6850H

WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
I whisper.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -mavx -mavx2 -mfma -mf16c -msse3 -DGGML_USE_OPENBLAS -I/usr/local/include/openblas
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:  -lopenblas
I CC:       cc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
I CXX:      g++ (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0

make: 'bench' is up to date.
ggml_mul_mat:    64 x    64: F16     12.6 GFLOPS (128 runs) / F32      9.8 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     19.4 GFLOPS (128 runs) / F32     12.5 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     27.0 GFLOPS (128 runs) / F32     18.4 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     50.3 GFLOPS (128 runs) / F32     28.1 GFLOPS (105 runs)
ggml_mul_mat:  1024 x  1024: F16     59.0 GFLOPS ( 28 runs) / F32     27.0 GFLOPS ( 13 runs)
ggml_mul_mat:  2048 x  2048: F16     43.0 GFLOPS (  3 runs) / F32     11.4 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     17.6 GFLOPS (  3 runs) / F32      6.6 GFLOPS (  3 runs)

Answer 79 · 2023-04-15T21:19:54.000Z

MacBook Pro M2 Max 96 GB 16-inch, 2023 13.3.1 (22E261)

I tried running 8 and 12 threads. They were a few ms slower than 4 threads. So the default 4threads is the key it seems.
I also have not compiled anything apple specific. Just git clone and make.

> ./extra/bench-all.sh 8
Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 50.22 GB/s
sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat: 64 x 64: F16 5.0 GFLOPS (128 runs) / F32 4.7 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 46.1 GFLOPS (128 runs) / F32 38.3 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 294.0 GFLOPS (128 runs) / F32 243.7 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 574.5 GFLOPS (128 runs) / F32 272.9 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 736.6 GFLOPS (128 runs) / F32 750.8 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 973.7 GFLOPS ( 57 runs) / F32 993.7 GFLOPS ( 58 runs)
ggml_mul_mat: 4096 x 4096: F16 1554.5 GFLOPS ( 12 runs) / F32 1553.6 GFLOPS ( 12 runs)

Running benchmark for all models
This can take a while!

Config	Model	Th	Load	Enc.	Commit
NEON BLAS	tiny	8	40	101	`c23588c`
NEON BLAS	base	8	61	223	`c23588c`
NEON BLAS	small	8	154	961	`c23588c`
NEON BLAS	medium	8	436	2534	`c23588c`
NEON BLAS	large	8	867	4100	`c23588c`

Answer 80 · 2023-04-17T10:45:42.000Z

Same hardware as in the post before. I've just tried converting to CoreML models and here are the results. The personal impression of running STT seemed very good - much faster.

./extra/bench-all.sh 4
Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 49.33 GB/s
sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 9.1 GFLOPS (128 runs) / F32 8.2 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 70.7 GFLOPS (128 runs) / F32 77.0 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 350.7 GFLOPS (128 runs) / F32 435.9 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 1060.0 GFLOPS (128 runs) / F32 1254.3 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 1611.0 GFLOPS (128 runs) / F32 1652.4 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 1887.2 GFLOPS (110 runs) / F32 1900.9 GFLOPS (111 runs)
ggml_mul_mat: 4096 x 4096: F16 1806.0 GFLOPS ( 14 runs) / F32 1849.3 GFLOPS ( 14 runs)

Running benchmark for all models
This can take a while!

Config	Model	Th	Load	Enc.	Commit
NEON BLAS COREML	tiny	4	42	30	`c23588c`
NEON BLAS COREML	base	4	60	49	`c23588c`
NEON BLAS COREML	small	4	151	169	`c23588c`
NEON BLAS COREML	medium	4	430	737	`c23588c`
NEON BLAS COREML	large	4	885	1672	`c23588c`

Answer 81 · 2023-04-26T12:01:52.000Z

Dell 3050 Micro
Running memcpy benchmark with 1 thread
memcpy: 11.49 GB/s
sum: error -536870997.000000

Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 7.7 GFLOPS (128 runs) / F32 3.3 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 27.7 GFLOPS (128 runs) / F32 7.5 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 50.8 GFLOPS (128 runs) / F32 8.8 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 59.4 GFLOPS (128 runs) / F32 9.0 GFLOPS ( 34 runs)
ggml_mul_mat: 1024 x 1024: F16 51.5 GFLOPS ( 24 runs) / F32 8.4 GFLOPS ( 4 runs)
ggml_mul_mat: 2048 x 2048: F16 46.3 GFLOPS ( 3 runs) / F32 8.1 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 47.3 GFLOPS ( 3 runs) / F32 8.1 GFLOPS ( 3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
i3-7100t	Ubuntu 22.04	AVX2	tiny	4	84	1125	`c23588c`
i3-7100t	Ubuntu 22.04	AVX2	base	4	128	2616	`c23588c`
i3-7100t	Ubuntu 22.04	AVX2	small	4	339	10127	`c23588c`
i3-7100t	Ubuntu 22.04	AVX2	medium	4	991	39383	`c23588c`
i3-7100t	Ubuntu 22.04	AVX2	large	4	2922	74488	`c23588c`

Answer 82 · 2023-04-26T14:17:55.000Z

Lenovo thinkcentre m720q

Running memcpy benchmark with 1 thread

memcpy: 6.54 GB/s
sum: error -536870997.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 8.6 GFLOPS (128 runs) / F32 4.5 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 38.8 GFLOPS (128 runs) / F32 7.9 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 76.2 GFLOPS (128 runs) / F32 9.6 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 87.4 GFLOPS (128 runs) / F32 10.0 GFLOPS ( 38 runs)
ggml_mul_mat: 1024 x 1024: F16 89.7 GFLOPS ( 42 runs) / F32 10.1 GFLOPS ( 5 runs)
ggml_mul_mat: 2048 x 2048: F16 67.7 GFLOPS ( 4 runs) / F32 9.1 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 54.7 GFLOPS ( 3 runs) / F32 8.6 GFLOPS ( 3 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
i5-8500T	OpenVoiceOS	AVX2	tiny.en	4	79	686	`70567ef`
i5-8500T	OpenVoiceOS	AVX2	base.en	4	121	1600	`70567ef`
i5-8500T	OpenVoiceOS	AVX2	small.en	4	320	6197	`70567ef`
i5-8500T	OpenVoiceOS	AVX2	medium.en	4	928	20276	`70567ef`

Running memcpy benchmark with 1 thread

memcpy: 7.16 GB/s
sum: error -536870997.000000

Running ggml_mul_mat benchmark with 6 threads

ggml_mul_mat: 64 x 64: F16 1.9 GFLOPS (128 runs) / F32 1.8 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 29.7 GFLOPS (128 runs) / F32 7.3 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 65.5 GFLOPS (128 runs) / F32 14.5 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 123.4 GFLOPS (128 runs) / F32 15.2 GFLOPS ( 57 runs)
ggml_mul_mat: 1024 x 1024: F16 127.5 GFLOPS ( 60 runs) / F32 14.7 GFLOPS ( 7 runs)
ggml_mul_mat: 2048 x 2048: F16 93.3 GFLOPS ( 6 runs) / F32 13.3 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 70.0 GFLOPS ( 3 runs) / F32 12.5 GFLOPS ( 3 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
i5-8500T	OpenVoiceOS	AVX2	tiny.en	6	78	511	`70567ef`
i5-8500T	OpenVoiceOS	AVX2	base.en	6	118	1264	`70567ef`
i5-8500T	OpenVoiceOS	AVX2	small.en	6	320	4587	`70567ef`
i5-8500T	OpenVoiceOS	AVX2	medium.en	6	928	16303	`70567ef`

Answer 83 · 2023-04-27T17:30:26.000Z

Yet another M1 Ultra but look at the bottom, comparision to Const-Me GPU version:
memcpy: 42.66 GB/s
sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 9.1 GFLOPS (128 runs) / F32 7.1 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 68.2 GFLOPS (128 runs) / F32 68.5 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 465.0 GFLOPS (128 runs) / F32 386.2 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 1131.9 GFLOPS (128 runs) / F32 1437.0 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 2188.9 GFLOPS (128 runs) / F32 2519.6 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 2938.8 GFLOPS (128 runs) / F32 2996.5 GFLOPS (128 runs)
ggml_mul_mat: 4096 x 4096: F16 3074.7 GFLOPS ( 23 runs) / F32 3167.2 GFLOPS ( 24 runs)

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| M1 Ultra | Ventura 13.3.1 | NEON BLAS | large | 4 | 858 | 3649 | 70567ef |

Much more interesting i find the comparison i did to a Win10 Core i9 9900K with Nvidia A4000 using the Const-Me Version. I used a 10 minute portion of a "real" tv show (-l de, about 56k tokens known in the model). Note that the power consumption has been measured too, it is not just guessing.

const-me whisper gpu (~450-550W real power consumption while 100% gpu utilisation, cpu is mostly bored)
A4000 1x parallel 93s
A4000 2x parallel both finish at 180s
A4000 4x parallel 3 finish after 317s, 1 finishes at 453s

MACOS, M1 Ultra (70-90W real power consumption while 100% "cpu" utilisation)
whisper cpp - default settings, 1 core, 4 threads
Macos 1x : 155 s
Macos 2x parallel: 196 s - all finish at same time
Macos 4x parallel: 274s - all finish at same time
Macos 6x parallel: 462s - all finish at same time

Also some other tests with different commandline params, on the M1 only, with 1 file:
-p8 (threads default 4) - system unresponsive while processing
120.3 seconds

-p 4 (default threads 4, ~80% cpu utilisation)
79.37545

-bs 2 -p 4
101.01730

-t 16 threads (processors default 1)
148.713

-p 8 -t 2
98.91152

We currently use the Const-me GPU version on Nvidia A5000 because on an intel cpu it delivers much faster results than this cpp version could do. Also it looks like Const-me version does not go anywhere while this repository is vibrant.

As a conclusion i can say that even if i hate it but we are buying this Mac because it delivers faster results, more throughput and all while consuming only 20% of power. Also, it has much better processing power distribution between mutliple parallel processes, i bet i can even use nice to give priorities while on the GPU there are no priorities whatsoever possible.

At our usage amount that means we saved the full costs of the mac (~4000 euros) after 2-3 years of operations (due to lower power costs and A/C) compared to running it on windows/gpu which we bought about the same initial price. Even if i could now safely say we dont need A5000 but just some gamer card for 600 euros, looking at the power costs these days i'd still prefer the mac. (Thanks god i dont need to put it into Active directory or such, so i have an easy time just using it as a slave processing machine)

Answer 84 · 2023-04-28T04:47:58.000Z

It would be great if watts idle/peak could be posted as I have been posting benches for RK3588 devices that prob gives the minimum usable results and even then a tad slow.
In that price range I just posted a I3-7100T that was picked up for £64 off ebay which is approx 8 watts idle / 30 peak.
I used to be a bit of an Apple hater in terms of bling tech, but bang for buck the M1 Mini is surprisingly good value and in that race-till-idle likely could process quite a number of zones especially because of diversification of use.

I am on disability so even though cheap the £849.00 for the 16gb version could prob be the basis of the ultimate home-assistant in something similar to https://github.com/ggerganov/whisper.cpp/blob/master/examples/talk-llama/talk-llama.cpp
So likely I will continue posting in the £64 range :)

But what Apple/Arm provide per watt currently is pretty special and for 24/365 in the energy expensive world that is pretty important.
Dunno how many people could post idle & peak wattages also but it would be really interesting especially with CPU vs GPU than just out right speed.

Answer 85 · 2023-05-01T17:36:22.000Z

Rock 5b

memcpy: 8.78 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.2 GFLOPS (128 runs) | Q4_1     7.6 GFLOPS (128 runs) | Q4_2     6.9 GFLOPS (128 runs)
  64 x   64: Q5_0     6.8 GFLOPS (128 runs) | Q5_1     7.0 GFLOPS (128 runs) | Q8_0     7.1 GFLOPS (128 runs)
  64 x   64: F16      8.6 GFLOPS (128 runs) | F32      7.5 GFLOPS (128 runs)
 128 x  128: Q4_0    22.8 GFLOPS (128 runs) | Q4_1    22.4 GFLOPS (128 runs) | Q4_2    19.6 GFLOPS (128 runs)
 128 x  128: Q5_0    19.5 GFLOPS (128 runs) | Q5_1    20.7 GFLOPS (128 runs) | Q8_0    22.7 GFLOPS (128 runs)
 128 x  128: F16     28.3 GFLOPS (128 runs) | F32     29.4 GFLOPS (128 runs)
 256 x  256: Q4_0    40.6 GFLOPS (128 runs) | Q4_1    37.6 GFLOPS (128 runs) | Q4_2    30.5 GFLOPS (128 runs)
 256 x  256: Q5_0    31.2 GFLOPS (128 runs) | Q5_1    31.9 GFLOPS (128 runs) | Q8_0    49.1 GFLOPS (128 runs)
 256 x  256: F16     51.8 GFLOPS (128 runs) | F32     36.9 GFLOPS (128 runs)
 512 x  512: Q4_0    52.0 GFLOPS (128 runs) | Q4_1    45.4 GFLOPS (128 runs) | Q4_2    35.7 GFLOPS (128 runs)
 512 x  512: Q5_0    37.4 GFLOPS (128 runs) | Q5_1    36.9 GFLOPS (128 runs) | Q8_0    64.9 GFLOPS (128 runs)
 512 x  512: F16     76.9 GFLOPS (128 runs) | F32     30.7 GFLOPS (115 runs)
1024 x 1024: Q4_0    56.6 GFLOPS ( 27 runs) | Q4_1    47.5 GFLOPS ( 23 runs) | Q4_2    37.5 GFLOPS ( 18 runs)
1024 x 1024: Q5_0    39.5 GFLOPS ( 19 runs) | Q5_1    37.7 GFLOPS ( 18 runs) | Q8_0    71.1 GFLOPS ( 34 runs)
1024 x 1024: F16     49.0 GFLOPS ( 23 runs) | F32     22.4 GFLOPS ( 11 runs)
2048 x 2048: Q4_0    54.2 GFLOPS (  4 runs) | Q4_1    44.6 GFLOPS (  3 runs) | Q4_2    38.5 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.4 GFLOPS (  3 runs) | Q5_1    35.5 GFLOPS (  3 runs) | Q8_0    61.0 GFLOPS (  4 runs)
2048 x 2048: F16     41.3 GFLOPS (  3 runs) | F32     19.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    56.2 GFLOPS (  3 runs) | Q4_1    45.4 GFLOPS (  3 runs) | Q4_2    38.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.7 GFLOPS (  3 runs) | Q5_1    37.3 GFLOPS (  3 runs) | Q8_0    63.2 GFLOPS (  3 runs)
4096 x 4096: F16     40.0 GFLOPS (  3 runs) | F32     17.5 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | tiny | 4 | 102 | 1191 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | base | 4 | 140 | 2861 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | small | 4 | 393 | 10576 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | medium | 4 | 10289 | 36042 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | large | 4 | 2099 | 70740 | be5911a |

Answer 86 · 2023-05-01T22:13:24.000Z

How do you get these numbers @StuartIanNaylor ? 😲
Isn't the Rock 5b basically the same as the Orange Pi 5?

Orange Pi 5 8GB:

Running memcpy benchmark

memcpy: 10.14 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     4.7 GFLOPS (128 runs) | Q4_1     4.8 GFLOPS (128 runs) | Q4_2     4.6 GFLOPS (128 runs)
  64 x   64: Q5_0     4.2 GFLOPS (128 runs) | Q5_1     4.4 GFLOPS (128 runs) | Q8_0     4.4 GFLOPS (128 runs)
  64 x   64: F16      4.8 GFLOPS (128 runs) | F32      4.4 GFLOPS (128 runs)
 128 x  128: Q4_0     4.2 GFLOPS (128 runs) | Q4_1     9.8 GFLOPS (128 runs) | Q4_2    10.0 GFLOPS (128 runs)
 128 x  128: Q5_0     8.4 GFLOPS (128 runs) | Q5_1     8.2 GFLOPS (128 runs) | Q8_0    10.3 GFLOPS (128 runs)
 128 x  128: F16     10.3 GFLOPS (128 runs) | F32     10.7 GFLOPS (128 runs)
 256 x  256: Q4_0    34.7 GFLOPS (128 runs) | Q4_1    34.9 GFLOPS (128 runs) | Q4_2    33.9 GFLOPS (128 runs)
 256 x  256: Q5_0    26.2 GFLOPS (128 runs) | Q5_1    24.9 GFLOPS (128 runs) | Q8_0    36.1 GFLOPS (128 runs)
 256 x  256: F16     36.4 GFLOPS (128 runs) | F32     38.4 GFLOPS (128 runs)
 512 x  512: Q4_0    22.2 GFLOPS ( 83 runs) | Q4_1    26.1 GFLOPS ( 98 runs) | Q4_2    35.5 GFLOPS (128 runs)
 512 x  512: Q5_0    42.4 GFLOPS (128 runs) | Q5_1    26.8 GFLOPS (100 runs) | Q8_0    35.8 GFLOPS (128 runs)
 512 x  512: F16     21.6 GFLOPS ( 81 runs) | F32     31.5 GFLOPS (118 runs)
1024 x 1024: Q4_0    32.4 GFLOPS ( 16 runs) | Q4_1    44.1 GFLOPS ( 21 runs) | Q4_2    39.7 GFLOPS ( 19 runs)
1024 x 1024: Q5_0    42.3 GFLOPS ( 20 runs) | Q5_1    40.4 GFLOPS ( 20 runs) | Q8_0    41.2 GFLOPS ( 20 runs)
1024 x 1024: F16     46.8 GFLOPS ( 22 runs) | F32     42.1 GFLOPS ( 20 runs)
2048 x 2048: Q4_0    50.9 GFLOPS (  4 runs) | Q4_1    48.6 GFLOPS (  3 runs) | Q4_2    48.0 GFLOPS (  3 runs)
2048 x 2048: Q5_0    46.7 GFLOPS (  3 runs) | Q5_1    47.8 GFLOPS (  3 runs) | Q8_0    46.4 GFLOPS (  3 runs)
2048 x 2048: F16     46.1 GFLOPS (  3 runs) | F32     44.8 GFLOPS (  3 runs)
4096 x 4096: Q4_0    42.2 GFLOPS (  3 runs) | Q4_1    36.7 GFLOPS (  3 runs) | Q4_2    33.0 GFLOPS (  3 runs)
4096 x 4096: Q5_0    38.5 GFLOPS (  3 runs) | Q5_1    44.7 GFLOPS (  3 runs) | Q8_0    44.7 GFLOPS (  3 runs)
4096 x 4096: F16     44.4 GFLOPS (  3 runs) | F32     44.5 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
RK3588S	Armbian 11 - 5.10.110	NEON BLAS	tiny	4	193	3748	`be5911a`
RK3588S	Armbian 11 - 5.10.110	NEON BLAS	tiny-q5_0	4	156	3341	`be5911a`
RK3588S	Armbian 11 - 5.10.110	NEON BLAS	base	4	253	7359	`be5911a`
RK3588S	Armbian 11 - 5.10.110	NEON BLAS	base-q5_0	4	178	7307	`be5911a`

[EDIT: a bit better without OpenBLAS although the GFLOPS are considerably lower O_o]

CPU	OS	Config	Model	Th	Load	Enc.	Commit
RK3588S	Armbian 11 - 5.10.110	NEON	tiny	4	111	3170	`be5911a`
RK3588S	Armbian 11 - 5.10.110	NEON	tiny-q5_0	4	205	2817	`be5911a`
RK3588S	Armbian 11 - 5.10.110	NEON	base	4	248	6385	`be5911a`
RK3588S	Armbian 11 - 5.10.110	NEON	base-q5_0	4	140	6198	`be5911a`

[EDIT2: getting very unstable results right now 🤔 ]

CPU	OS	Config	Model	Th	Load	Enc.	Commit
RK3588S	Armbian 11 - 5.10.110	NEON	tiny	4	269	1722	`be5911a`
RK3588S	Armbian 11 - 5.10.110	NEON	tiny-q5_0	4	104	2746	`be5911a`
RK3588S	Armbian 11 - 5.10.110	NEON	base	4	243	7063	`be5911a`
RK3588S	Armbian 11 - 5.10.110	NEON	base-q5_0	4	135	6516	`be5911a`

Answer 87 · 2023-05-02T09:28:25.000Z

Likely I don't use Armbian but the supplied server image by Radxa and also the OPI version.
Generally I stay clear of Armbian due to a pet hate of there epic init script that replaces standard installs and /etc and often blind sights me.

I add some tricks and tips I gathered when Radxa do a community board bring up.
I have changed my pref for the scheduler and set it to performance and also and I dunno why but using taskset to make sure it just uses the big cores has a slight perf boost.

So running again I get

memcpy: 8.56 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.3 GFLOPS (128 runs) | Q4_1     7.8 GFLOPS (128 runs) | Q4_2     6.9 GFLOPS (128 runs)
  64 x   64: Q5_0     6.2 GFLOPS (128 runs) | Q5_1     6.7 GFLOPS (128 runs) | Q8_0     7.0 GFLOPS (128 runs)
  64 x   64: F16      2.4 GFLOPS (128 runs) | F32      8.5 GFLOPS (128 runs)
 128 x  128: Q4_0    23.2 GFLOPS (128 runs) | Q4_1    24.1 GFLOPS (128 runs) | Q4_2    19.9 GFLOPS (128 runs)
 128 x  128: Q5_0    15.4 GFLOPS (128 runs) | Q5_1    21.0 GFLOPS (128 runs) | Q8_0    26.6 GFLOPS (128 runs)
 128 x  128: F16     35.0 GFLOPS (128 runs) | F32     28.6 GFLOPS (128 runs)
 256 x  256: Q4_0    41.2 GFLOPS (128 runs) | Q4_1    38.7 GFLOPS (128 runs) | Q4_2    30.5 GFLOPS (128 runs)
 256 x  256: Q5_0    31.2 GFLOPS (128 runs) | Q5_1    31.9 GFLOPS (128 runs) | Q8_0    49.1 GFLOPS (128 runs)
 256 x  256: F16     65.0 GFLOPS (128 runs) | F32     43.5 GFLOPS (128 runs)
 512 x  512: Q4_0    52.0 GFLOPS (128 runs) | Q4_1    45.4 GFLOPS (128 runs) | Q4_2    35.3 GFLOPS (128 runs)
 512 x  512: Q5_0    37.4 GFLOPS (128 runs) | Q5_1    36.8 GFLOPS (128 runs) | Q8_0    64.9 GFLOPS (128 runs)
 512 x  512: F16     78.1 GFLOPS (128 runs) | F32     30.6 GFLOPS (114 runs)
1024 x 1024: Q4_0    56.4 GFLOPS ( 27 runs) | Q4_1    47.4 GFLOPS ( 23 runs) | Q4_2    37.5 GFLOPS ( 18 runs)
1024 x 1024: Q5_0    39.5 GFLOPS ( 19 runs) | Q5_1    37.7 GFLOPS ( 18 runs) | Q8_0    70.8 GFLOPS ( 33 runs)
1024 x 1024: F16     47.2 GFLOPS ( 22 runs) | F32     21.8 GFLOPS ( 11 runs)
2048 x 2048: Q4_0    54.4 GFLOPS (  4 runs) | Q4_1    45.3 GFLOPS (  3 runs) | Q4_2    38.6 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.4 GFLOPS (  3 runs) | Q5_1    35.6 GFLOPS (  3 runs) | Q8_0    59.8 GFLOPS (  4 runs)
2048 x 2048: F16     41.2 GFLOPS (  3 runs) | F32     20.6 GFLOPS (  3 runs)
4096 x 4096: Q4_0    56.9 GFLOPS (  3 runs) | Q4_1    46.6 GFLOPS (  3 runs) | Q4_2    38.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    41.1 GFLOPS (  3 runs) | Q5_1    37.4 GFLOPS (  3 runs) | Q8_0    62.9 GFLOPS (  3 runs)
4096 x 4096: F16     39.8 GFLOPS (  3 runs) | F32     17.6 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 96 | 1199 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 137 | 2875 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 343 | 10635 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 1013 | 35174 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 2019 | 71678 | be5911a |

If I run without previously echo performance | tee /sys/bus/cpu/devices/cpu[046]/cpufreq/scaling_governor /sys/class/devfreq/dmc/governor as the rk3588[x] is a tri-cluster 4-2-2 and dunno about the dmc but it was something we where using at that time.
Prefix (taskset -c 4-7) to further enforce not using the efficiency cores.

The ondemand governor seems to load balance whilst at least Whisper.cpp a race-till-idle more like how Android is set up does seem to give a perf boost with little loss in efficiency, if none.

Without bench gives

memcpy: 7.82 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     3.1 GFLOPS (128 runs) | Q4_1     2.8 GFLOPS (128 runs) | Q                                                                                                          4_2     2.4 GFLOPS (128 runs)
  64 x   64: Q5_0     2.3 GFLOPS (128 runs) | Q5_1     2.2 GFLOPS (128 runs) | Q                                                                                                          8_0     2.7 GFLOPS (128 runs)
  64 x   64: F16      3.1 GFLOPS (128 runs) | F32      2.6 GFLOPS (128 runs)
 128 x  128: Q4_0     7.1 GFLOPS (128 runs) | Q4_1     7.0 GFLOPS (128 runs) | Q                                                                                                          4_2     6.2 GFLOPS (128 runs)
 128 x  128: Q5_0     5.4 GFLOPS (128 runs) | Q5_1     5.4 GFLOPS (128 runs) | Q                                                                                                          8_0     7.2 GFLOPS (128 runs)
 128 x  128: F16      9.3 GFLOPS (128 runs) | F32      5.9 GFLOPS (128 runs)
 256 x  256: Q4_0    10.1 GFLOPS (128 runs) | Q4_1     9.5 GFLOPS (128 runs) | Q                                                                                                          4_2     8.4 GFLOPS (128 runs)
 256 x  256: Q5_0     7.4 GFLOPS (128 runs) | Q5_1     6.9 GFLOPS (128 runs) | Q                                                                                                          8_0    10.9 GFLOPS (128 runs)
 256 x  256: F16     13.4 GFLOPS (128 runs) | F32      7.9 GFLOPS (128 runs)
 512 x  512: Q4_0    10.9 GFLOPS ( 41 runs) | Q4_1    10.4 GFLOPS ( 39 runs) | Q                                                                                                          4_2     8.5 GFLOPS ( 32 runs)
 512 x  512: Q5_0     8.9 GFLOPS ( 34 runs) | Q5_1     8.2 GFLOPS ( 31 runs) | Q                                                                                                          8_0    12.1 GFLOPS ( 46 runs)
 512 x  512: F16     14.5 GFLOPS ( 54 runs) | F32      8.7 GFLOPS ( 33 runs)
1024 x 1024: Q4_0    26.9 GFLOPS ( 13 runs) | Q4_1    24.9 GFLOPS ( 12 runs) | Q                                                                                                          4_2    21.7 GFLOPS ( 11 runs)
1024 x 1024: Q5_0    23.0 GFLOPS ( 11 runs) | Q5_1    22.0 GFLOPS ( 11 runs) | Q                                                                                                          8_0    29.1 GFLOPS ( 14 runs)
1024 x 1024: F16     28.2 GFLOPS ( 14 runs) | F32     17.9 GFLOPS (  9 runs)
2048 x 2048: Q4_0    50.1 GFLOPS (  3 runs) | Q4_1    41.3 GFLOPS (  3 runs) | Q                                                                                                          4_2    36.7 GFLOPS (  3 runs)
2048 x 2048: Q5_0    36.0 GFLOPS (  3 runs) | Q5_1    33.2 GFLOPS (  3 runs) | Q                                                                                                          8_0    53.7 GFLOPS (  4 runs)
2048 x 2048: F16     37.5 GFLOPS (  3 runs) | F32     19.3 GFLOPS (  3 runs)
4096 x 4096: Q4_0    55.7 GFLOPS (  3 runs) | Q4_1    43.7 GFLOPS (  3 runs) | Q                                                                                                          4_2    39.4 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.5 GFLOPS (  3 runs) | Q5_1    36.1 GFLOPS (  3 runs) | Q                                                                                                          8_0    65.8 GFLOPS (  3 runs)
4096 x 4096: F16     36.8 GFLOPS (  3 runs) | F32     18.5 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 171 | 1817 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 255 | 3529 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 433 | 11208 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 1814 | 36829 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 36647 | 71393 | be5911a |

I will tack on the OPI5 next as think it is a smidge faster.
So without again

memcpy: 8.26 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     3.1 GFLOPS (128 runs) | Q4_1     3.3 GFLOPS (128 runs) | Q4_2     3.4 GFLOPS (128 runs)
  64 x   64: Q5_0     1.7 GFLOPS (128 runs) | Q5_1     3.1 GFLOPS (128 runs) | Q8_0     2.9 GFLOPS (128 runs)
  64 x   64: F16      4.0 GFLOPS (128 runs) | F32      3.5 GFLOPS (128 runs)
 128 x  128: Q4_0     7.8 GFLOPS (128 runs) | Q4_1     6.6 GFLOPS (128 runs) | Q4_2     6.7 GFLOPS (128 runs)
 128 x  128: Q5_0     5.6 GFLOPS (128 runs) | Q5_1     5.4 GFLOPS (128 runs) | Q8_0     8.7 GFLOPS (128 runs)
 128 x  128: F16     10.1 GFLOPS (128 runs) | F32      6.3 GFLOPS (128 runs)
 256 x  256: Q4_0    10.5 GFLOPS (128 runs) | Q4_1     9.1 GFLOPS (128 runs) | Q4_2     7.9 GFLOPS (128 runs)
 256 x  256: Q5_0     7.0 GFLOPS (128 runs) | Q5_1     6.7 GFLOPS (128 runs) | Q8_0    12.6 GFLOPS (128 runs)
 256 x  256: F16     12.6 GFLOPS (128 runs) | F32      7.5 GFLOPS (128 runs)
 512 x  512: Q4_0    11.9 GFLOPS ( 45 runs) | Q4_1    10.8 GFLOPS ( 41 runs) | Q4_2    10.0 GFLOPS ( 38 runs)
 512 x  512: Q5_0     8.5 GFLOPS ( 32 runs) | Q5_1     7.9 GFLOPS ( 30 runs) | Q8_0    14.5 GFLOPS ( 54 runs)
 512 x  512: F16     14.2 GFLOPS ( 53 runs) | F32      8.3 GFLOPS ( 32 runs)
1024 x 1024: Q4_0    30.4 GFLOPS ( 15 runs) | Q4_1    28.9 GFLOPS ( 14 runs) | Q4_2    23.6 GFLOPS ( 11 runs)
1024 x 1024: Q5_0    23.0 GFLOPS ( 11 runs) | Q5_1    23.5 GFLOPS ( 12 runs) | Q8_0    37.4 GFLOPS ( 18 runs)
1024 x 1024: F16     33.9 GFLOPS ( 16 runs) | F32     18.0 GFLOPS (  9 runs)
2048 x 2048: Q4_0    51.4 GFLOPS (  4 runs) | Q4_1    42.5 GFLOPS (  3 runs) | Q4_2    36.5 GFLOPS (  3 runs)
2048 x 2048: Q5_0    36.0 GFLOPS (  3 runs) | Q5_1    32.7 GFLOPS (  3 runs) | Q8_0    59.0 GFLOPS (  4 runs)
2048 x 2048: F16     39.4 GFLOPS (  3 runs) | F32     17.5 GFLOPS (  3 runs)
4096 x 4096: Q4_0    58.8 GFLOPS (  3 runs) | Q4_1    47.0 GFLOPS (  3 runs) | Q4_2    39.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.8 GFLOPS (  3 runs) | Q5_1    37.3 GFLOPS (  3 runs) | Q8_0    65.1 GFLOPS (  3 runs)
4096 x 4096: F16     40.6 GFLOPS (  3 runs) | F32     18.6 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 133 | 1235 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 232 | 2941 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 470 | 10870 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 23195 | 36162 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 46511 | 90187 | be5911a |

Then as sudo orangepi-config set the perf governor (no dmc)
taskset -c 4-7 ,/extra/bench-all.sh

memcpy: 8.22 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     0.7 GFLOPS (128 runs) | Q4_1     1.6 GFLOPS (128 runs) | Q                                                                                                  4_2     1.0 GFLOPS (128 runs)
  64 x   64: Q5_0     0.6 GFLOPS (128 runs) | Q5_1     0.8 GFLOPS (128 runs) | Q                                                                                                  8_0     1.4 GFLOPS (128 runs)
  64 x   64: F16      1.9 GFLOPS (128 runs) | F32      0.8 GFLOPS (128 runs)
 128 x  128: Q4_0     8.9 GFLOPS (128 runs) | Q4_1     3.8 GFLOPS (128 runs) | Q                                                                                                  4_2     3.1 GFLOPS (128 runs)
 128 x  128: Q5_0     5.8 GFLOPS (128 runs) | Q5_1     3.8 GFLOPS (128 runs) | Q                                                                                                  8_0     7.8 GFLOPS (128 runs)
 128 x  128: F16      5.2 GFLOPS (128 runs) | F32      3.6 GFLOPS (128 runs)
 256 x  256: Q4_0    13.1 GFLOPS (128 runs) | Q4_1    12.1 GFLOPS (128 runs) | Q                                                                                                  4_2    12.1 GFLOPS (128 runs)
 256 x  256: Q5_0    12.8 GFLOPS (128 runs) | Q5_1    13.4 GFLOPS (128 runs) | Q                                                                                                  8_0    17.9 GFLOPS (128 runs)
 256 x  256: F16     17.6 GFLOPS (128 runs) | F32     11.0 GFLOPS (128 runs)
 512 x  512: Q4_0    33.3 GFLOPS (125 runs) | Q4_1    34.7 GFLOPS (128 runs) | Q                                                                                                  4_2    21.9 GFLOPS ( 82 runs)
 512 x  512: Q5_0    21.4 GFLOPS ( 80 runs) | Q5_1    22.4 GFLOPS ( 84 runs) | Q                                                                                                  8_0    35.2 GFLOPS (128 runs)
 512 x  512: F16     37.1 GFLOPS (128 runs) | F32     23.2 GFLOPS ( 87 runs)
1024 x 1024: Q4_0    54.9 GFLOPS ( 26 runs) | Q4_1    44.3 GFLOPS ( 21 runs) | Q                                                                                                  4_2    31.4 GFLOPS ( 15 runs)
1024 x 1024: Q5_0    35.7 GFLOPS ( 17 runs) | Q5_1    32.1 GFLOPS ( 15 runs) | Q                                                                                                  8_0    66.5 GFLOPS ( 31 runs)
1024 x 1024: F16     45.0 GFLOPS ( 21 runs) | F32     19.6 GFLOPS ( 10 runs)
2048 x 2048: Q4_0    54.6 GFLOPS (  4 runs) | Q4_1    45.2 GFLOPS (  3 runs) | Q                                                                                                  4_2    38.4 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.9 GFLOPS (  3 runs) | Q5_1    34.7 GFLOPS (  3 runs) | Q                                                                                                  8_0    59.9 GFLOPS (  4 runs)
2048 x 2048: F16     40.5 GFLOPS (  3 runs) | F32     20.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    59.5 GFLOPS (  3 runs) | Q4_1    47.7 GFLOPS (  3 runs) | Q                                                                                                  4_2    40.1 GFLOPS (  3 runs)
4096 x 4096: Q5_0    42.7 GFLOPS (  3 runs) | Q5_1    39.6 GFLOPS (  3 runs) | Q                                                                                                  8_0    60.7 GFLOPS (  3 runs)
4096 x 4096: F16     35.5 GFLOPS (  3 runs) | F32     20.8 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 119 | 1178 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 168 | 2910 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 399 | 10784 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 23469 | 35952 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 47147 | 76405 | be5911a |

I ran that again as think transformers do bounce around abit to end up with the same tokens.

memcpy: 9.46 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.1 GFLOPS (128 runs) | Q4_1     7.6 GFLOPS (128 runs) | Q4_2     6.6 GFLOPS (128 runs)
  64 x   64: Q5_0     6.3 GFLOPS (128 runs) | Q5_1     6.9 GFLOPS (128 runs) | Q8_0     6.6 GFLOPS (128 runs)
  64 x   64: F16      7.8 GFLOPS (128 runs) | F32      7.3 GFLOPS (128 runs)
 128 x  128: Q4_0    23.8 GFLOPS (128 runs) | Q4_1    25.0 GFLOPS (128 runs) | Q4_2     8.5 GFLOPS (128 runs)
 128 x  128: Q5_0    19.1 GFLOPS (128 runs) | Q5_1    20.8 GFLOPS (128 runs) | Q8_0    26.4 GFLOPS (128 runs)
 128 x  128: F16     34.8 GFLOPS (128 runs) | F32     28.6 GFLOPS (128 runs)
 256 x  256: Q4_0    43.4 GFLOPS (128 runs) | Q4_1    42.0 GFLOPS (128 runs) | Q4_2    31.3 GFLOPS (128 runs)
 256 x  256: Q5_0    30.5 GFLOPS (128 runs) | Q5_1    32.0 GFLOPS (128 runs) | Q8_0    41.7 GFLOPS (128 runs)
 256 x  256: F16     60.0 GFLOPS (128 runs) | F32     42.9 GFLOPS (128 runs)
 512 x  512: Q4_0    56.5 GFLOPS (128 runs) | Q4_1    49.5 GFLOPS (128 runs) | Q4_2    36.6 GFLOPS (128 runs)
 512 x  512: Q5_0    36.7 GFLOPS (128 runs) | Q5_1    36.8 GFLOPS (128 runs) | Q8_0    69.9 GFLOPS (128 runs)
 512 x  512: F16     78.5 GFLOPS (128 runs) | F32     30.1 GFLOPS (113 runs)
1024 x 1024: Q4_0    62.7 GFLOPS ( 30 runs) | Q4_1    52.2 GFLOPS ( 25 runs) | Q4_2    38.9 GFLOPS ( 19 runs)
1024 x 1024: Q5_0    39.2 GFLOPS ( 19 runs) | Q5_1    38.2 GFLOPS ( 18 runs) | Q8_0    76.2 GFLOPS ( 36 runs)
1024 x 1024: F16     46.7 GFLOPS ( 22 runs) | F32     21.6 GFLOPS ( 11 runs)
2048 x 2048: Q4_0    60.4 GFLOPS (  4 runs) | Q4_1    50.3 GFLOPS (  3 runs) | Q4_2    39.6 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.9 GFLOPS (  3 runs) | Q5_1    35.4 GFLOPS (  3 runs) | Q8_0    66.5 GFLOPS (  4 runs)
2048 x 2048: F16     33.8 GFLOPS (  3 runs) | F32     15.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    64.2 GFLOPS (  3 runs) | Q4_1    51.2 GFLOPS (  3 runs) | Q4_2    40.2 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.7 GFLOPS (  3 runs) | Q5_1    37.2 GFLOPS (  3 runs) | Q8_0    71.5 GFLOPS (  3 runs)
4096 x 4096: F16     38.5 GFLOPS (  3 runs) | F32     20.3 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 103 | 1166 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 152 | 2888 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 379 | 10892 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 22649 | 35767 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 45427 | 73967 | be5911a |

But don't seem to get that much variance, race-till-idle is just preference.

Answer 88 · 2023-05-02T12:20:53.000Z

Prefix (taskset -c 4-7) to further enforce not using the efficiency cores.

Tried that played with the CPU settings (performance mode etc.), even added some better cooling but it still keeps jumping all over the place with the tiny model at ~2s (in the good runs) while 'htop' shows consistent 100% load on the performance cores. Q5 models are sometimes a few ms faster sometimes slower.
When I do the same tests with the CTranslate2 Whisper version results are pretty stable and always about twice as fast.

Answer 89 · 2023-05-02T15:45:15.000Z

Dunno just to show the next run is very consistant and considerabilly faster... ?

memcpy: 10.52 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     2.5 GFLOPS (128 runs) | Q4_1     2.5 GFLOPS (128 runs) | Q4_2     1.3 GFLOPS (128 runs)
  64 x   64: Q5_0     1.0 GFLOPS (128 runs) | Q5_1     0.6 GFLOPS (128 runs) | Q8_0     0.8 GFLOPS (128 runs)
  64 x   64: F16      1.0 GFLOPS (128 runs) | F32      1.8 GFLOPS (128 runs)
 128 x  128: Q4_0     2.8 GFLOPS (128 runs) | Q4_1     2.2 GFLOPS (128 runs) | Q4_2     6.7 GFLOPS (128 runs)
 128 x  128: Q5_0     3.2 GFLOPS (128 runs) | Q5_1     5.5 GFLOPS (128 runs) | Q8_0     3.0 GFLOPS (128 runs)
 128 x  128: F16     11.2 GFLOPS (128 runs) | F32      8.5 GFLOPS (128 runs)
 256 x  256: Q4_0    13.5 GFLOPS (128 runs) | Q4_1     8.8 GFLOPS (128 runs) | Q4_2     9.9 GFLOPS (128 runs)
 256 x  256: Q5_0    10.7 GFLOPS (128 runs) | Q5_1     6.7 GFLOPS (128 runs) | Q8_0     7.3 GFLOPS (128 runs)
 256 x  256: F16     18.3 GFLOPS (128 runs) | F32     10.1 GFLOPS (128 runs)
 512 x  512: Q4_0    36.4 GFLOPS (128 runs) | Q4_1    31.2 GFLOPS (117 runs) | Q4_2    19.0 GFLOPS ( 71 runs)
 512 x  512: Q5_0    18.5 GFLOPS ( 69 runs) | Q5_1    20.4 GFLOPS ( 77 runs) | Q8_0    30.7 GFLOPS (115 runs)
 512 x  512: F16     33.8 GFLOPS (126 runs) | F32     20.7 GFLOPS ( 79 runs)
1024 x 1024: Q4_0    40.0 GFLOPS ( 19 runs) | Q4_1    36.4 GFLOPS ( 18 runs) | Q4_2    29.6 GFLOPS ( 14 runs)
1024 x 1024: Q5_0    32.9 GFLOPS ( 16 runs) | Q5_1    30.6 GFLOPS ( 15 runs) | Q8_0    54.2 GFLOPS ( 26 runs)
1024 x 1024: F16     44.1 GFLOPS ( 21 runs) | F32     20.0 GFLOPS ( 10 runs)
2048 x 2048: Q4_0    57.7 GFLOPS (  4 runs) | Q4_1    47.7 GFLOPS (  3 runs) | Q4_2    38.7 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.8 GFLOPS (  3 runs) | Q5_1    35.1 GFLOPS (  3 runs) | Q8_0    63.6 GFLOPS (  4 runs)
2048 x 2048: F16     33.6 GFLOPS (  3 runs) | F32     14.8 GFLOPS (  3 runs)
4096 x 4096: Q4_0    61.9 GFLOPS (  3 runs) | Q4_1    50.2 GFLOPS (  3 runs) | Q4_2    38.8 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.6 GFLOPS (  3 runs) | Q5_1    37.9 GFLOPS (  3 runs) | Q8_0    70.4 GFLOPS (  3 runs)
4096 x 4096: F16     38.0 GFLOPS (  3 runs) | F32     20.8 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 134 | 1176 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 179 | 2964 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 416 | 11037 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 23462 | 36469 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 47286 | 77494 | be5911a |

Answer 90 · 2023-05-02T15:57:13.000Z

System76 Pangolin (pang12) w/ Ryzen 7 6800U (8c16t) @ 2.7GHz + 32GB DDR5 at 6400MT/s
Models stored on a Samsung 970 Evo Plus

Running memcpy benchmark with 1 thread

memcpy: 11.18 GB/s
sum:    error -536870997.000000

Running ggml_mul_mat benchmark with 16 threads

ggml_mul_mat:   64 x   64: Q4_0     0.9 GFLOPS (128 runs) / Q4_1     0.4 GFLOPS (128 runs) / F16     1.2 GFLOPS (128 runs) / F32     1.2 GFLOPS (128 runs)
ggml_mul_mat:  128 x  128: Q4_0     6.1 GFLOPS (128 runs) / Q4_1     7.5 GFLOPS (128 runs) / F16     4.6 GFLOPS (128 runs) / F32    10.0 GFLOPS (128 runs)
ggml_mul_mat:  256 x  256: Q4_0    26.2 GFLOPS (128 runs) / Q4_1    42.3 GFLOPS (128 runs) / F16    19.9 GFLOPS (128 runs) / F32    47.9 GFLOPS (128 runs)
ggml_mul_mat:  512 x  512: Q4_0    66.6 GFLOPS (128 runs) / Q4_1    98.6 GFLOPS (128 runs) / F16    90.1 GFLOPS (128 runs) / F32   110.4 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: Q4_0    97.8 GFLOPS ( 46 runs) / Q4_1   154.3 GFLOPS ( 72 runs) / F16   158.7 GFLOPS ( 74 runs) / F32   132.2 GFLOPS ( 62 runs)
ggml_mul_mat: 2048 x 2048: Q4_0   126.7 GFLOPS (  8 runs) / Q4_1   164.8 GFLOPS ( 10 runs) / F16   164.1 GFLOPS ( 10 runs) / F32    96.4 GFLOPS (  6 runs)
ggml_mul_mat: 4096 x 4096: Q4_0   138.6 GFLOPS (  3 runs) / Q4_1   166.9 GFLOPS (  3 runs) / F16   136.0 GFLOPS (  3 runs) / F32    62.8 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ryzen 7 6800U	Arch Linux	AVX2	tiny	16	37	510	`9c61f5f`
Ryzen 7 6800U	Arch Linux	AVX2	base	16	51	1222	`9c61f5f`
Ryzen 7 6800U	Arch Linux	AVX2	small	16	123	4283	`9c61f5f`
Ryzen 7 6800U	Arch Linux	AVX2	medium	16	341	14178	`9c61f5f`
Ryzen 7 6800U	Arch Linux	AVX2	large	16	650	25801	`9c61f5f`

Answer 91 · 2023-05-05T07:31:48.000Z

MacBook Air M2 24GB 2022 (CoreML model)

It is interesting to note that when converted to a CoreML model and executed, even a Macbook Air M2 has a processing speed close to that of a high-spec Mac, perhaps because the specifications of the Neural engine are the same for the same generation of Apple Silicon.

./extra/bench-all.sh 4
Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 34.33 GB/s
sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 11.4 GFLOPS (128 runs) / F32 10.5 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 89.0 GFLOPS (128 runs) / F32 74.8 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 422.6 GFLOPS (128 runs) / F32 419.4 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 793.4 GFLOPS (128 runs) / F32 801.8 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 827.0 GFLOPS (128 runs) / F32 849.3 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 821.8 GFLOPS ( 48 runs) / F32 773.4 GFLOPS ( 46 runs)
ggml_mul_mat: 4096 x 4096: F16 765.2 GFLOPS ( 6 runs) / F32 743.6 GFLOPS ( 6 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
		NEON BLAS COREML	tiny	4			`c23588c`
		NEON BLAS COREML	base	4			`c23588c`
M2	13.3.1 (a)（22E772610a）	NEON BLAS COREML	small	4	153	199	`c23588c`
M2	13.3.1 (a)（22E772610a）	NEON BLAS COREML	medium	4	450	746	`c23588c`
M2	13.3.1 (a)（22E772610a）	NEON BLAS COREML	large	4	1053	1439	`c23588c`

Answer 92 · 2023-05-05T21:47:14.000Z

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Raspberry Pi 4 2GB	Bullseye 6.1.21-v8+	OPENBLAS	tiny.en	4	393	7882	`14bee39`
Raspberry Pi 4 2GB	Bullseye 6.1.21-v8+	OPENBLAS	tiny.en-q5	4	265	8564	`14bee39`
Raspberry Pi 4 2GB	Bullseye 6.1.21-v8+	OPENBLAS	base.en	4	571	16328	`14bee39`
Raspberry Pi 4 2GB	Bullseye 6.1.21-v8+	OPENBLAS	base.en-q5	4	306	16169	`14bee39`

Tests performed using Raspberry Pi OS libopenblas-dev package (version 0.3.13+ds-3).

Answer 93 · 2023-05-14T15:20:32.000Z

Ryzen 3 2200GE (Lenovo M715q)

Running memcpy benchmark

memcpy: 12.14 GB/s (1 thread)
sum:    -536869898.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     5.3 GFLOPS (128 runs) | Q4_1     1.6 GFLOPS (128 runs) | Q4_2     5.2 GFLOPS (128 runs)
  64 x   64: Q5_0     5.5 GFLOPS (128 runs) | Q5_1     1.7 GFLOPS (128 runs) | Q8_0     1.7 GFLOPS (128 runs)
  64 x   64: F16      1.1 GFLOPS (128 runs) | F32      2.0 GFLOPS (128 runs)
 128 x  128: Q4_0     9.9 GFLOPS (128 runs) | Q4_1    10.8 GFLOPS (128 runs) | Q4_2     9.8 GFLOPS (128 runs)
 128 x  128: Q5_0    16.7 GFLOPS (128 runs) | Q5_1    19.0 GFLOPS (128 runs) | Q8_0    20.6 GFLOPS (128 runs)
 128 x  128: F16      9.4 GFLOPS (128 runs) | F32     29.8 GFLOPS (128 runs)
 256 x  256: Q4_0    26.1 GFLOPS (128 runs) | Q4_1    29.4 GFLOPS (128 runs) | Q4_2    31.2 GFLOPS (128 runs)
 256 x  256: Q5_0    28.4 GFLOPS (128 runs) | Q5_1    31.0 GFLOPS (128 runs) | Q8_0    32.5 GFLOPS (128 runs)
 256 x  256: F16     21.5 GFLOPS (128 runs) | F32     41.6 GFLOPS (128 runs)
 512 x  512: Q4_0    41.4 GFLOPS (128 runs) | Q4_1    42.7 GFLOPS (128 runs) | Q4_2    43.2 GFLOPS (128 runs)
 512 x  512: Q5_0    39.2 GFLOPS (128 runs) | Q5_1    37.2 GFLOPS (128 runs) | Q8_0    56.7 GFLOPS (128 runs)
 512 x  512: F16     29.3 GFLOPS (110 runs) | F32     56.0 GFLOPS (128 runs)
1024 x 1024: Q4_0    52.5 GFLOPS ( 25 runs) | Q4_1    51.6 GFLOPS ( 25 runs) | Q4_2    48.3 GFLOPS ( 23 runs)
1024 x 1024: Q5_0    44.1 GFLOPS ( 21 runs) | Q5_1    41.9 GFLOPS ( 20 runs) | Q8_0    71.4 GFLOPS ( 34 runs)
1024 x 1024: F16     30.4 GFLOPS ( 15 runs) | F32     35.5 GFLOPS ( 17 runs)
2048 x 2048: Q4_0    54.6 GFLOPS (  4 runs) | Q4_1    50.6 GFLOPS (  3 runs) | Q4_2    49.8 GFLOPS (  3 runs)
2048 x 2048: Q5_0    44.8 GFLOPS (  3 runs) | Q5_1    40.8 GFLOPS (  3 runs) | Q8_0    67.1 GFLOPS (  4 runs)
2048 x 2048: F16     29.1 GFLOPS (  3 runs) | F32     20.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    54.3 GFLOPS (  3 runs) | Q4_1    50.0 GFLOPS (  3 runs) | Q4_2    49.5 GFLOPS (  3 runs)
4096 x 4096: Q5_0    44.7 GFLOPS (  3 runs) | Q5_1    40.2 GFLOPS (  3 runs) | Q8_0    64.0 GFLOPS (  3 runs)
4096 x 4096: F16     28.3 GFLOPS (  3 runs) | F32     19.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | tiny | 4 | 68 | 1676 | 2b6a074 |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | base | 4 | 96 | 3850 | 2b6a074 |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | small | 4 | 235 | 14734 | 2b6a074 |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | medium | 4 | 660 | 49288 | 2b6a074 |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | large | 4 | 1302 | 105757 | 2b6a074 |

Answer 94 · 2023-05-15T18:59:25.000Z

This is what I get with clblast on an AMD RX6700XT:

Running memcpy benchmark

memcpy: 11.94 GB/s (1 thread)
sum: -536869898.000000

Running ggml_mul_mat benchmark with 16 threads

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx1031
64 x 64: Q4_0 0.8 GFLOPS (128 runs) | Q4_1 0.8 GFLOPS (128 runs)
64 x 64: Q5_0 0.8 GFLOPS (128 runs) | Q5_1 0.8 GFLOPS (128 runs) | Q8_0 0.8 GFLOPS (128 runs)
64 x 64: F16 0.8 GFLOPS (128 runs) | F32 0.8 GFLOPS (128 runs)
128 x 128: Q4_0 5.6 GFLOPS (128 runs) | Q4_1 5.6 GFLOPS (128 runs)
128 x 128: Q5_0 6.1 GFLOPS (128 runs) | Q5_1 5.7 GFLOPS (128 runs) | Q8_0 6.1 GFLOPS (128 runs)
128 x 128: F16 5.8 GFLOPS (128 runs) | F32 6.0 GFLOPS (128 runs)
256 x 256: Q4_0 43.4 GFLOPS (128 runs) | Q4_1 40.3 GFLOPS (128 runs)
256 x 256: Q5_0 38.2 GFLOPS (128 runs) | Q5_1 39.2 GFLOPS (128 runs) | Q8_0 39.0 GFLOPS (128 runs)
256 x 256: F16 38.3 GFLOPS (128 runs) | F32 38.6 GFLOPS (128 runs)
512 x 512: Q4_0 210.9 GFLOPS (128 runs) | Q4_1 212.8 GFLOPS (128 runs)
512 x 512: Q5_0 212.0 GFLOPS (128 runs) | Q5_1 213.2 GFLOPS (128 runs) | Q8_0 210.2 GFLOPS (128 runs)
512 x 512: F16 195.5 GFLOPS (128 runs) | F32 208.7 GFLOPS (128 runs)
1024 x 1024: Q4_0 1280.6 GFLOPS (128 runs) | Q4_1 1289.0 GFLOPS (128 runs)
1024 x 1024: Q5_0 1292.2 GFLOPS (128 runs) | Q5_1 1287.4 GFLOPS (128 runs) | Q8_0 1271.0 GFLOPS (128 runs)
1024 x 1024: F16 1025.9 GFLOPS (128 runs) | F32 1227.8 GFLOPS (128 runs)
2048 x 2048: Q4_0 3423.2 GFLOPS (128 runs) | Q4_1 3414.1 GFLOPS (128 runs)
2048 x 2048: Q5_0 3393.6 GFLOPS (128 runs) | Q5_1 3385.8 GFLOPS (128 runs) | Q8_0 3385.2 GFLOPS (128 runs)
2048 x 2048: F16 2434.4 GFLOPS (128 runs) | F32 3045.8 GFLOPS (128 runs)
4096 x 4096: Q4_0 4187.6 GFLOPS ( 31 runs) | Q4_1 4193.6 GFLOPS ( 31 runs)
4096 x 4096: Q5_0 4204.3 GFLOPS ( 31 runs) | Q5_1 4187.1 GFLOPS ( 31 runs) | Q8_0 4135.0 GFLOPS ( 31 runs)
4096 x 4096: F16 3491.1 GFLOPS ( 26 runs) | F32 3911.3 GFLOPS ( 29 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ryzen 5950X / RX6700XT	Arch	AVX2 BLAS	tiny	16	382	603	`95b02d7`
Ryzen 5950X / RX6700XT	Arch	AVX2 BLAS	base	16	371	717	`95b02d7`
Ryzen 5950X / RX6700XT	Arch	AVX2 BLAS	small	16	427	1271	`95b02d7`
Ryzen 5950X / RX6700XT	Arch	AVX2 BLAS	medium	16	636	2784	`95b02d7`
Ryzen 5950X / RX6700XT	Arch	AVX2 BLAS	large	16	868	4308	`95b02d7`

Answer 95 · 2023-05-18T14:17:59.000Z

Thinkpad T480, Core i7 8550U

Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy: 12.67 GB/s (1 thread)
sum:    -536869898.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     6.1 GFLOPS (128 runs) | Q4_1     6.4 GFLOPS (128 runs)
  64 x   64: Q5_0     6.6 GFLOPS (128 runs) | Q5_1     6.7 GFLOPS (128 runs) | Q8_0     6.3 GFLOPS (128 runs)
  64 x   64: F16      7.8 GFLOPS (128 runs) | F32      5.4 GFLOPS (128 runs)
 128 x  128: Q4_0    25.3 GFLOPS (128 runs) | Q4_1    25.5 GFLOPS (128 runs)
 128 x  128: Q5_0    29.6 GFLOPS (128 runs) | Q5_1    26.9 GFLOPS (128 runs) | Q8_0    31.7 GFLOPS (128 runs)
 128 x  128: F16     34.8 GFLOPS (128 runs) | F32     13.8 GFLOPS (128 runs)
 256 x  256: Q4_0    49.9 GFLOPS (128 runs) | Q4_1    43.3 GFLOPS (128 runs)
 256 x  256: Q5_0    46.6 GFLOPS (128 runs) | Q5_1    45.4 GFLOPS (128 runs) | Q8_0    64.0 GFLOPS (128 runs)
 256 x  256: F16     61.2 GFLOPS (128 runs) | F32     18.7 GFLOPS (128 runs)
 512 x  512: Q4_0    66.7 GFLOPS (128 runs) | Q4_1    54.7 GFLOPS (128 runs)
 512 x  512: Q5_0    53.5 GFLOPS (128 runs) | Q5_1    57.9 GFLOPS (128 runs) | Q8_0    80.6 GFLOPS (128 runs)
 512 x  512: F16     65.5 GFLOPS (128 runs) | F32     22.2 GFLOPS ( 83 runs)
1024 x 1024: Q4_0    77.7 GFLOPS ( 37 runs) | Q4_1    66.9 GFLOPS ( 32 runs)
1024 x 1024: Q5_0    66.3 GFLOPS ( 31 runs) | Q5_1    60.2 GFLOPS ( 29 runs) | Q8_0    91.6 GFLOPS ( 44 runs)
1024 x 1024: F16     63.8 GFLOPS ( 30 runs) | F32     21.2 GFLOPS ( 10 runs)
2048 x 2048: Q4_0    74.3 GFLOPS (  5 runs) | Q4_1    71.1 GFLOPS (  5 runs)
2048 x 2048: Q5_0    59.5 GFLOPS (  4 runs) | Q5_1    56.4 GFLOPS (  4 runs) | Q8_0    90.2 GFLOPS (  6 runs)
2048 x 2048: F16     49.9 GFLOPS (  3 runs) | F32     15.9 GFLOPS (  3 runs)
4096 x 4096: Q4_0    61.1 GFLOPS (  3 runs) | Q4_1    54.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    48.4 GFLOPS (  3 runs) | Q5_1    45.1 GFLOPS (  3 runs) | Q8_0    62.7 GFLOPS (  3 runs)
4096 x 4096: F16     38.4 GFLOPS (  3 runs) | F32     12.9 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |

I don't know why it stopped when it wanted to run the benchmark for all models? I have ggml-base.en.bin, and I have for-tests-ggml*.bin.

Answer 96 · 2023-05-18T15:27:03.000Z

@randomshinichi That is what its does when the non en models are not avail

Answer 97 · 2023-05-30T08:41:34.000Z

Jetson Orin Nano (Developer Kit) - Unoptimised install (no CLBlast, CUBLAS etc)

Running memcpy benchmark

memcpy: 6.28 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     4.1 GFLOPS (128 runs) | Q4_1     4.2 GFLOPS (128 runs)
  64 x   64: Q5_0     4.2 GFLOPS (128 runs) | Q5_1     4.1 GFLOPS (128 runs) | Q8_0     4.6 GFLOPS (128 runs)
  64 x   64: F16      4.0 GFLOPS (128 runs) | F32      5.2 GFLOPS (128 runs)
 128 x  128: Q4_0    12.9 GFLOPS (128 runs) | Q4_1    13.2 GFLOPS (128 runs)
 128 x  128: Q5_0    12.7 GFLOPS (128 runs) | Q5_1    12.5 GFLOPS (128 runs) | Q8_0    14.1 GFLOPS (128 runs)
 128 x  128: F16      9.3 GFLOPS (128 runs) | F32     20.9 GFLOPS (128 runs)
 256 x  256: Q4_0    17.9 GFLOPS (128 runs) | Q4_1    17.5 GFLOPS (128 runs)
 256 x  256: Q5_0    17.8 GFLOPS (128 runs) | Q5_1    16.2 GFLOPS (128 runs) | Q8_0    20.3 GFLOPS (128 runs)
 256 x  256: F16     10.4 GFLOPS (128 runs) | F32     28.8 GFLOPS (128 runs)
 512 x  512: Q4_0    21.1 GFLOPS ( 79 runs) | Q4_1    20.0 GFLOPS ( 75 runs)
 512 x  512: Q5_0    18.6 GFLOPS ( 70 runs) | Q5_1    19.1 GFLOPS ( 72 runs) | Q8_0    22.0 GFLOPS ( 83 runs)
 512 x  512: F16     10.5 GFLOPS ( 40 runs) | F32     25.7 GFLOPS ( 97 runs)
1024 x 1024: Q4_0    20.6 GFLOPS ( 10 runs) | Q4_1    20.4 GFLOPS ( 10 runs)
1024 x 1024: Q5_0    20.2 GFLOPS ( 10 runs) | Q5_1    18.7 GFLOPS (  9 runs) | Q8_0    23.2 GFLOPS ( 11 runs)
1024 x 1024: F16     11.4 GFLOPS (  6 runs) | F32     16.6 GFLOPS (  8 runs)
2048 x 2048: Q4_0    22.3 GFLOPS (  3 runs) | Q4_1    22.4 GFLOPS (  3 runs)
2048 x 2048: Q5_0    22.0 GFLOPS (  3 runs) | Q5_1    20.9 GFLOPS (  3 runs) | Q8_0    25.8 GFLOPS (  3 runs)
2048 x 2048: F16     11.9 GFLOPS (  3 runs) | F32     11.5 GFLOPS (  3 runs)
4096 x 4096: Q4_0    22.7 GFLOPS (  3 runs) | Q4_1    22.6 GFLOPS (  3 runs)
4096 x 4096: Q5_0    22.2 GFLOPS (  3 runs) | Q5_1    21.0 GFLOPS (  3 runs) | Q8_0    26.2 GFLOPS (  3 runs)
4096 x 4096: F16     12.0 GFLOPS (  3 runs) | F32     13.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON	tiny	4	117	3631	`5e2b340`
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON	base	4	153	8603	`5e2b340`
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON	small	4	323	33605	`5e2b340`
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON	medium	4	1059	111404	`5e2b340`
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON	large	4	3187	222130	`5e2b340`

Answer 98 · 2023-05-30T09:20:14.000Z

Jetson Orin Nano (Developer Kit)

Running memcpy benchmark

memcpy: 6.28 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     4.1 GFLOPS (128 runs) | Q4_1     4.2 GFLOPS (128 runs)
  64 x   64: Q5_0     4.2 GFLOPS (128 runs) | Q5_1     4.1 GFLOPS (128 runs) | Q8_0     4.6 GFLOPS (128 runs)
  64 x   64: F16      4.0 GFLOPS (128 runs) | F32      5.2 GFLOPS (128 runs)
 128 x  128: Q4_0    12.9 GFLOPS (128 runs) | Q4_1    13.2 GFLOPS (128 runs)
 128 x  128: Q5_0    12.7 GFLOPS (128 runs) | Q5_1    12.5 GFLOPS (128 runs) | Q8_0    14.1 GFLOPS (128 runs)
 128 x  128: F16      9.3 GFLOPS (128 runs) | F32     20.9 GFLOPS (128 runs)
 256 x  256: Q4_0    17.9 GFLOPS (128 runs) | Q4_1    17.5 GFLOPS (128 runs)
 256 x  256: Q5_0    17.8 GFLOPS (128 runs) | Q5_1    16.2 GFLOPS (128 runs) | Q8_0    20.3 GFLOPS (128 runs)
 256 x  256: F16     10.4 GFLOPS (128 runs) | F32     28.8 GFLOPS (128 runs)
 512 x  512: Q4_0    21.1 GFLOPS ( 79 runs) | Q4_1    20.0 GFLOPS ( 75 runs)
 512 x  512: Q5_0    18.6 GFLOPS ( 70 runs) | Q5_1    19.1 GFLOPS ( 72 runs) | Q8_0    22.0 GFLOPS ( 83 runs)
 512 x  512: F16     10.5 GFLOPS ( 40 runs) | F32     25.7 GFLOPS ( 97 runs)
1024 x 1024: Q4_0    20.6 GFLOPS ( 10 runs) | Q4_1    20.4 GFLOPS ( 10 runs)
1024 x 1024: Q5_0    20.2 GFLOPS ( 10 runs) | Q5_1    18.7 GFLOPS (  9 runs) | Q8_0    23.2 GFLOPS ( 11 runs)
1024 x 1024: F16     11.4 GFLOPS (  6 runs) | F32     16.6 GFLOPS (  8 runs)
2048 x 2048: Q4_0    22.3 GFLOPS (  3 runs) | Q4_1    22.4 GFLOPS (  3 runs)
2048 x 2048: Q5_0    22.0 GFLOPS (  3 runs) | Q5_1    20.9 GFLOPS (  3 runs) | Q8_0    25.8 GFLOPS (  3 runs)
2048 x 2048: F16     11.9 GFLOPS (  3 runs) | F32     11.5 GFLOPS (  3 runs)
4096 x 4096: Q4_0    22.7 GFLOPS (  3 runs) | Q4_1    22.6 GFLOPS (  3 runs)
4096 x 4096: Q5_0    22.2 GFLOPS (  3 runs) | Q5_1    21.0 GFLOPS (  3 runs) | Q8_0    26.2 GFLOPS (  3 runs)
4096 x 4096: F16     12.0 GFLOPS (  3 runs) | F32     13.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON tiny 4 117 3631 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON base 4 153 8603 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON small 4 323 33605 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON medium 4 1059 111404 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON large 4 3187 222130 5e2b340

@mark-beeby You sure everything is correct with your distro as your results are really bad, to what I was expecting. As been looking forward to see what a Orin nano could do.

Check out an rk3588 #89 (comment) as that is an A76x4 with DDR4 not DDR5...

Also interested in what you get with cuBlas https://github.com/ggerganov/whisper.cpp#opencl-gpu-support-via-clblast

Answer 99 · 2023-05-30T14:03:29.000Z

Jetson Orin Nano (Developer Kit) - CUBLAS

Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy: 6.26 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     1.0 GFLOPS (128 runs) | Q4_1     0.9 GFLOPS (128 runs)
  64 x   64: Q5_0     0.7 GFLOPS (128 runs) | Q5_1     0.9 GFLOPS (128 runs) | Q8_0     1.0 GFLOPS (128 runs)
  64 x   64: F16      1.0 GFLOPS (128 runs) | F32      0.9 GFLOPS (128 runs)
 128 x  128: Q4_0     6.8 GFLOPS (128 runs) | Q4_1     7.3 GFLOPS (128 runs)
 128 x  128: Q5_0     7.8 GFLOPS (128 runs) | Q5_1     7.8 GFLOPS (128 runs) | Q8_0     7.8 GFLOPS (128 runs)
 128 x  128: F16      8.0 GFLOPS (128 runs) | F32      7.7 GFLOPS (128 runs)
 256 x  256: Q4_0    57.1 GFLOPS (128 runs) | Q4_1    62.5 GFLOPS (128 runs)
 256 x  256: Q5_0    62.3 GFLOPS (128 runs) | Q5_1    62.8 GFLOPS (128 runs) | Q8_0    64.6 GFLOPS (128 runs)
 256 x  256: F16     38.7 GFLOPS (128 runs) | F32     38.6 GFLOPS (128 runs)
 512 x  512: Q4_0   248.6 GFLOPS (128 runs) | Q4_1   250.9 GFLOPS (128 runs)
 512 x  512: Q5_0   250.2 GFLOPS (128 runs) | Q5_1   248.7 GFLOPS (128 runs) | Q8_0   247.8 GFLOPS (128 runs)
 512 x  512: F16    215.2 GFLOPS (128 runs) | F32    210.5 GFLOPS (128 runs)
1024 x 1024: Q4_0   884.6 GFLOPS (128 runs) | Q4_1   882.7 GFLOPS (128 runs)
1024 x 1024: Q5_0   879.2 GFLOPS (128 runs) | Q5_1   872.7 GFLOPS (128 runs) | Q8_0   632.0 GFLOPS (128 runs)
1024 x 1024: F16    651.2 GFLOPS (128 runs) | F32    627.2 GFLOPS (128 runs)
2048 x 2048: Q4_0  1349.9 GFLOPS ( 79 runs) | Q4_1  1337.1 GFLOPS ( 78 runs)
2048 x 2048: Q5_0  1332.3 GFLOPS ( 78 runs) | Q5_1  1327.7 GFLOPS ( 78 runs) | Q8_0  1304.8 GFLOPS ( 76 runs)
2048 x 2048: F16   1401.6 GFLOPS ( 82 runs) | F32   1140.0 GFLOPS ( 67 runs)
4096 x 4096: Q4_0  1967.6 GFLOPS ( 15 runs) | Q4_1  1962.9 GFLOPS ( 15 runs)
4096 x 4096: Q5_0  1956.3 GFLOPS ( 15 runs) | Q5_1  1952.7 GFLOPS ( 15 runs) | Q8_0  1929.9 GFLOPS ( 15 runs)
4096 x 4096: F16   2603.2 GFLOPS ( 19 runs) | F32   1742.4 GFLOPS ( 13 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON BLAS	tiny	4	1296	544	`5e2b340`
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON BLAS	base	4	1350	1015	`5e2b340`
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON BLAS	small	4	1557	2901	`5e2b340`
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON BLAS	medium	4	2303	7977	`5e2b340`
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON BLAS	large	4	6716	12913	`5e2b340`

@StuartIanNaylor I've struggled to get clblast installed, and moved back to a CUDA install, and after a few hiccups and setting export CUDA_VISIBLE_DEVICES=0 I got the much more favourable results above. Hope that helps!

Answer 100 · 2023-05-30T14:07:19.000Z

New desktop I built - CPU i7-13700K (turbo overclock +200MHz base), DDR5 @ 5600MT/s, GPU Intel Arc A770 LE

I tried differing numbers of thread counts, before settling on 20. Anything past 20 resulted in a drop in performance, which is obviously going to happen.

Running memcpy benchmark

memcpy: 23.16 GB/s (1 thread)
sum:    -536869898.000000

Running ggml_mul_mat benchmark with 20 threads


Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: Intel(R) OpenCL HD Graphics Device: Intel(R) Arc(TM) A770 Graphics
  64 x   64: Q4_0     0.9 GFLOPS (128 runs) | Q4_1     1.0 GFLOPS (128 runs)
  64 x   64: Q5_0     1.0 GFLOPS (128 runs) | Q5_1     1.0 GFLOPS (128 runs) | Q8_0     1.0 GFLOPS (128 runs)
  64 x   64: F16      1.0 GFLOPS (128 runs) | F32      1.0 GFLOPS (128 runs)
 128 x  128: Q4_0     5.6 GFLOPS (128 runs) | Q4_1     5.8 GFLOPS (128 runs)
 128 x  128: Q5_0     5.7 GFLOPS (128 runs) | Q5_1     5.4 GFLOPS (128 runs) | Q8_0     5.0 GFLOPS (128 runs)
 128 x  128: F16      5.6 GFLOPS (128 runs) | F32      5.5 GFLOPS (128 runs)
 256 x  256: Q4_0    40.4 GFLOPS (128 runs) | Q4_1    38.9 GFLOPS (128 runs)
 256 x  256: Q5_0    40.7 GFLOPS (128 runs) | Q5_1    40.3 GFLOPS (128 runs) | Q8_0    38.5 GFLOPS (128 runs)
 256 x  256: F16     40.8 GFLOPS (128 runs) | F32     40.8 GFLOPS (128 runs)
 512 x  512: Q4_0   260.5 GFLOPS (128 runs) | Q4_1   264.6 GFLOPS (128 runs)
 512 x  512: Q5_0   234.3 GFLOPS (128 runs) | Q5_1   254.8 GFLOPS (128 runs) | Q8_0   260.2 GFLOPS (128 runs)
 512 x  512: F16    223.7 GFLOPS (128 runs) | F32    261.0 GFLOPS (128 runs)
1024 x 1024: Q4_0  1158.0 GFLOPS (128 runs) | Q4_1  1158.2 GFLOPS (128 runs)
1024 x 1024: Q5_0  1119.2 GFLOPS (128 runs) | Q5_1  1157.4 GFLOPS (128 runs) | Q8_0  1125.5 GFLOPS (128 runs)
1024 x 1024: F16    871.3 GFLOPS (128 runs) | F32   1029.7 GFLOPS (128 runs)
2048 x 2048: Q4_0  2847.7 GFLOPS (128 runs) | Q4_1  2749.8 GFLOPS (128 runs)
2048 x 2048: Q5_0  2752.3 GFLOPS (128 runs) | Q5_1  2879.4 GFLOPS (128 runs) | Q8_0  2770.3 GFLOPS (128 runs)
2048 x 2048: F16   2061.0 GFLOPS (120 runs) | F32   2504.5 GFLOPS (128 runs)
4096 x 4096: Q4_0  4681.2 GFLOPS ( 35 runs) | Q4_1  4637.2 GFLOPS ( 34 runs)
4096 x 4096: Q5_0  4646.7 GFLOPS ( 34 runs) | Q5_1  4586.6 GFLOPS ( 34 runs) | Q8_0  4589.7 GFLOPS ( 34 runs)
4096 x 4096: F16   3444.7 GFLOPS ( 26 runs) | F32   4128.2 GFLOPS ( 31 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Intel Core i7-13700K	Arch Linux	AVX2 BLAS	tiny	20	145	417	`5e2b340`
Intel Core i7-13700K	Arch Linux	AVX2 BLAS	base	20	161	560	`5e2b340`
Intel Core i7-13700K	Arch Linux	AVX2 BLAS	small	20	281	1072	`5e2b340`
Intel Core i7-13700K	Arch Linux	AVX2 BLAS	medium	20	606	2771	`5e2b340`
Intel Core i7-13700K	Arch Linux	AVX2 BLAS	large	20	1116	4105	`5e2b340`

CPU power draw during these last tests averaged 140 watts, peaking at 141. GPU metrics are currently not exposed in Linux for Arc, so I'm unable to check what that was drawing.