How can I benchmark GEMMs with `arm_compute_benchmark`?
FabianSchuetze opened this issue · 1 comments
Thanks for the wonderful library. Apologies if this seems to be a silly question:
How can I benchmark gemms on a android target?
In line with the docs for test I run the following on my target (including output):
gts9:/data/local/tmp $ LD_LIBRARY_PATH=$PWD ./arm_compute_benchmark --mode=precommit
Version = arm_compute_version=v24.06 Build options: {'toolchain_prefix': 'aarch64-linux-android33-', 'opencl': '0', 'arch': 'armv8.6-a', 'build': 'cross_compile', 'os': 'android', 'benchmark_tests': '1', 'embed_kernels': '1'} Git hash=b'505adb91d40e05b3f80a075a4467a78a253395e1'
CommandLine = ./arm_compute_benchmark
Seed = 1148538226
cpu_has_sve = false
cpu_has_sve2 = false
cpu_has_svef32mm = false
cpu_has_svei8mm = false
cpu_has_svebf16 = false
cpu_has_sme = false
cpu_has_sme2 = false
cpu_has_fp16 = true
cpu_has_bf16 = true
cpu_has_dotprod = true
cpu_has_i8mm = true
CPU0 = A510
CPU1 = A510
CPU2 = A510
CPU3 = GENERIC
CPU4 = GENERIC
CPU5 = GENERIC
CPU6 = GENERIC
CPU7 = GENERIC
Iterations = 1
Threads = 1
Dataset mode = PRECOMMIT
Running [0] 'NEON/Scale/RunSmall@Shape=640,480:DataType=U8:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1892.0000 us
Running [1] 'NEON/Scale/RunSmall@Shape=640,480:DataType=U8:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1894.0000 us
Running [2] 'NEON/Scale/RunSmall@Shape=640,480:DataType=U8:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1939.0000 us
Running [3] 'NEON/Scale/RunSmall@Shape=640,480:DataType=U8:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=3446.0000 us
Running [4] 'NEON/Scale/RunSmall@Shape=640,480:DataType=U8:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=3439.0000 us
Running [5] 'NEON/Scale/RunSmall@Shape=640,480:DataType=U8:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=3836.0000 us
Running [6] 'NEON/Scale/RunSmall@Shape=640,480:DataType=S16:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1172.0000 us
Running [7] 'NEON/Scale/RunSmall@Shape=640,480:DataType=S16:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1121.0000 us
Running [8] 'NEON/Scale/RunSmall@Shape=640,480:DataType=S16:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1133.0000 us
Running [9] 'NEON/Scale/RunSmall@Shape=640,480:DataType=S16:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=3530.0000 us
Running [10] 'NEON/Scale/RunSmall@Shape=640,480:DataType=S16:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=3548.0000 us
Running [11] 'NEON/Scale/RunSmall@Shape=640,480:DataType=S16:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=2594.0000 us
Running [12] 'NEON/Scale/RunSmall@Shape=640,480:DataType=F32:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1113.0000 us
Running [13] 'NEON/Scale/RunSmall@Shape=640,480:DataType=F32:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1114.0000 us
Running [14] 'NEON/Scale/RunSmall@Shape=640,480:DataType=F32:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1279.0000 us
Running [15] 'NEON/Scale/RunSmall@Shape=640,480:DataType=F32:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=2387.0000 us
Running [16] 'NEON/Scale/RunSmall@Shape=640,480:DataType=F32:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=2317.0000 us
Running [17] 'NEON/Scale/RunSmall@Shape=640,480:DataType=F32:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1500.0000 us
Running [18] 'NEON/Scale/RunSmall@Shape=800,600:DataType=U8:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1644.0000 us
Running [19] 'NEON/Scale/RunSmall@Shape=800,600:DataType=U8:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1622.0000 us
Running [20] 'NEON/Scale/RunSmall@Shape=800,600:DataType=U8:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1622.0000 us
Running [21] 'NEON/Scale/RunSmall@Shape=800,600:DataType=U8:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=3305.0000 us
Running [22] 'NEON/Scale/RunSmall@Shape=800,600:DataType=U8:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=3277.0000 us
Running [23] 'NEON/Scale/RunSmall@Shape=800,600:DataType=U8:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=3040.0000 us
Running [24] 'NEON/Scale/RunSmall@Shape=800,600:DataType=S16:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1417.0000 us
Running [25] 'NEON/Scale/RunSmall@Shape=800,600:DataType=S16:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1349.0000 us
Running [26] 'NEON/Scale/RunSmall@Shape=800,600:DataType=S16:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1379.0000 us
Running [27] 'NEON/Scale/RunSmall@Shape=800,600:DataType=S16:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=3388.0000 us
Running [28] 'NEON/Scale/RunSmall@Shape=800,600:DataType=S16:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=3346.0000 us
Running [29] 'NEON/Scale/RunSmall@Shape=800,600:DataType=S16:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=2433.0000 us
Running [30] 'NEON/Scale/RunSmall@Shape=800,600:DataType=F32:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1365.0000 us
Running [31] 'NEON/Scale/RunSmall@Shape=800,600:DataType=F32:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1358.0000 us
Running [32] 'NEON/Scale/RunSmall@Shape=800,600:DataType=F32:DataLayout=NCHW:InterpolationPolicy=NEAREST_NEIGHBOR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=1365.0000 us
Running [33] 'NEON/Scale/RunSmall@Shape=800,600:DataType=F32:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=UNDEFINED:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=3183.0000 us
Running [34] 'NEON/Scale/RunSmall@Shape=800,600:DataType=F32:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=CONSTANT:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=3251.0000 us
Running [35] 'NEON/Scale/RunSmall@Shape=800,600:DataType=F32:DataLayout=NCHW:InterpolationPolicy=BILINEAR:BorderMode=REPLICATE:SamplingPolicy=CENTER'
Wall clock/Wall clock time: AVG=2341.0000 us
Executed 36 test(s) (36 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 0 second(s)
gts9:/data/local/tmp $
However, only Scale
benchmarks seem to be run.
I am interested in running Int8 GEMMS with (Int 32 accumulator) and obtain the GFLOPS/sec my target supports. I would like to use all cores on my system. I would best like to test the SMMLA
(UMMLA
) instructions.
I build the arm_compute_benchmark
binary with the following command:
CC=clang CXX=clang++ scons -j8 toolchain_prefix=aarch64-linux-android33- opencl=0 arch=armv8.6-a build=cross_compile os=android benchmark_tests=1 embed_kernels=1 neon=1
I also had to slightly modify the Sconstruct
file, the patch is below ( a bit hacky, but I'm only interest in cross-compilation):
diff --git a/SConstruct b/SConstruct
index bad85e503d..5282a8d537 100644
--- a/SConstruct
+++ b/SConstruct
@@ -418,9 +418,10 @@ if env['os'] == 'windows':
env['AR'] = "llvm-lib"
env['RANLIB'] = "llvm-ranlib"
else:
- env['AR'] = toolchain_prefix + "ar"
+ # env['AR'] = toolchain_prefix + "clang++"
+ env['AR'] = "llvm-ar"
-env['RANLIB'] = toolchain_prefix + "ranlib"
+env['RANLIB'] = "llvm-ranlib"
print("Using compilers:")
print("CC", env['CC'])
I finally figured it out.
The library needs to be built with the additional option benchmark_examples
and the test is run on the device with:
./benchmark_neon_sgemm --iterations=100 --example_args=2048,2048,2048