ARM-software/ComputeLibrary

How to parallelize the SGEMM example across many threads?

FabianSchuetze opened this issue · 1 comments

I am benchmarking the reference sgemm neon implementation. It seems that the operation runs at ~40-50 GFLOPS/sec:

benchmark_neon_sgemm  --iterations=10 --example_args=2048,2048,2048                                <
Version = arm_compute_version=v24.06 Build options: {'toolchain_prefix': 'aarch64-linux-android33-', 'neon': '1', 'opencl': '0', 'arch': 'armv8.6-a', 'build': 'cross_compile', 'os': 'android', 'benchmark_tests': '1', 'embed_kernels': '0', 'validation_tests': '1', 'benchmark_examples': '1'} Git hash=b'505adb91d40e05b3f80a075a4467a78a253395e1'
CommandLine = ./benchmark_neon_sgemm --iterations=10 --example_args=2048,2048,2048 
Iterations = 10
Running [0] 'Examples/benchmark_neon_sgemm'
Wall clock/Wall clock time:    AVG=206853.5556 us, STDDEV=1.80 %, MIN=200307.0000 us, MAX=212690.0000 us, MEDIAN=207170.0000 us

There are 2048**3 flops involved in the calculation and the operations runs at ~200ms, this makes for ~42 GFLOPS. Smaller kernels run faster, but never cross the 50 GFLOPS/sec mark.

That seem to be a bit slow to me. I wonder across how many cores and how many sockets the work is split? How can I influence the parallelism of the work? I the examples/neon_sgemm.cpp file, I do not see any options to parallelize the work.

Edit
I see that the work in fact run on only one core:

 [CORE][03-07-2024 09:45:50][INFO]  "Set CPPScheduler to Linear mode, with 1 threads to use\n"

I have rebuild the entire library with cppthreads=1 or openmp=1, but the work is still only parallelized across on thread. How can I extend the parallelization? The test system has several cores, and I would like to use them all. The topography (the cache sizes seem to be off) is:

Machine (7221MB total)
  L3 L#0 (0KB)
    NUMANode L#0 (P#0 7221MB)
    Package L#0
      L2 L#0 (0KB) + L1d L#0 (0KB) + L1i L#0 (0KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (0KB)
        L1d L#1 (0KB) + L1i L#1 (0KB) + Core L#1 + PU L#1 (P#1)
        L1d L#2 (0KB) + L1i L#2 (0KB) + Core L#2 + PU L#2 (P#2)
    Package L#1
      L2 L#2 (0KB) + L1d L#3 (0KB) + L1i L#3 (0KB) + Core L#3 + PU L#3 (P#3)
      L2 L#3 (0KB) + L1d L#4 (0KB) + L1i L#4 (0KB) + Core L#4 + PU L#4 (P#4)
      L2 L#4 (0KB) + L1d L#5 (0KB) + L1i L#5 (0KB) + Core L#5 + PU L#5 (P#5)
      L2 L#5 (0KB) + L1d L#6 (0KB) + L1i L#6 (0KB) + Core L#6 + PU L#6 (P#6)
    Package L#2 + L2 L#6 (0KB) + L1d L#7 (0KB) + L1i L#7 (0KB) + Core L#7 + PU L#7 (P#7)

I think I figured it out. The Scheduler class offers static function to set the threads. In the do_setup function, one can write:

        Scheduler::Type scheduler_t = Scheduler::get_type();
        IScheduler& scheduler = Scheduler::get();
        scheduler.set_num_threads(8);

which is then confirmed by the logs:

[CORE][03-07-2024 10:31:35][INFO]  "Set CPPScheduler to Linear mode, with 8 threads to use\n"