How to parallelize the SGEMM example across many threads?
FabianSchuetze opened this issue · 1 comments
I am benchmarking the reference sgemm neon implementation. It seems that the operation runs at ~40-50 GFLOPS/sec:
benchmark_neon_sgemm --iterations=10 --example_args=2048,2048,2048 <
Version = arm_compute_version=v24.06 Build options: {'toolchain_prefix': 'aarch64-linux-android33-', 'neon': '1', 'opencl': '0', 'arch': 'armv8.6-a', 'build': 'cross_compile', 'os': 'android', 'benchmark_tests': '1', 'embed_kernels': '0', 'validation_tests': '1', 'benchmark_examples': '1'} Git hash=b'505adb91d40e05b3f80a075a4467a78a253395e1'
CommandLine = ./benchmark_neon_sgemm --iterations=10 --example_args=2048,2048,2048
Iterations = 10
Running [0] 'Examples/benchmark_neon_sgemm'
Wall clock/Wall clock time: AVG=206853.5556 us, STDDEV=1.80 %, MIN=200307.0000 us, MAX=212690.0000 us, MEDIAN=207170.0000 us
There are 2048**3
flops involved in the calculation and the operations runs at ~200ms, this makes for ~42 GFLOPS. Smaller kernels run faster, but never cross the 50 GFLOPS/sec mark.
That seem to be a bit slow to me. I wonder across how many cores and how many sockets the work is split? How can I influence the parallelism of the work? I the examples/neon_sgemm.cpp
file, I do not see any options to parallelize the work.
Edit
I see that the work in fact run on only one core:
[CORE][03-07-2024 09:45:50][INFO] "Set CPPScheduler to Linear mode, with 1 threads to use\n"
I have rebuild the entire library with cppthreads=1
or openmp=1
, but the work is still only parallelized across on thread. How can I extend the parallelization? The test system has several cores, and I would like to use them all. The topography (the cache sizes seem to be off) is:
Machine (7221MB total)
L3 L#0 (0KB)
NUMANode L#0 (P#0 7221MB)
Package L#0
L2 L#0 (0KB) + L1d L#0 (0KB) + L1i L#0 (0KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (0KB)
L1d L#1 (0KB) + L1i L#1 (0KB) + Core L#1 + PU L#1 (P#1)
L1d L#2 (0KB) + L1i L#2 (0KB) + Core L#2 + PU L#2 (P#2)
Package L#1
L2 L#2 (0KB) + L1d L#3 (0KB) + L1i L#3 (0KB) + Core L#3 + PU L#3 (P#3)
L2 L#3 (0KB) + L1d L#4 (0KB) + L1i L#4 (0KB) + Core L#4 + PU L#4 (P#4)
L2 L#4 (0KB) + L1d L#5 (0KB) + L1i L#5 (0KB) + Core L#5 + PU L#5 (P#5)
L2 L#5 (0KB) + L1d L#6 (0KB) + L1i L#6 (0KB) + Core L#6 + PU L#6 (P#6)
Package L#2 + L2 L#6 (0KB) + L1d L#7 (0KB) + L1i L#7 (0KB) + Core L#7 + PU L#7 (P#7)
I think I figured it out. The Scheduler
class offers static function to set the threads. In the do_setup
function, one can write:
Scheduler::Type scheduler_t = Scheduler::get_type();
IScheduler& scheduler = Scheduler::get();
scheduler.set_num_threads(8);
which is then confirmed by the logs:
[CORE][03-07-2024 10:31:35][INFO] "Set CPPScheduler to Linear mode, with 8 threads to use\n"