/arm-deinterleaving-loads

Benchmarking Arm SIMD de-interleaving loads against scalar instructions.

Primary LanguageAssemblyGNU Lesser General Public License v3.0LGPL-3.0

Benchmarking Arm SIMD de-interleaving loads

About

This repository implements minimal code examples that measure the performance Arm SIMD (NEON and SVE) de-interleaving load instructions with 64-bit floating-point data: ld2X, ld3X and ld4X. The assembly code is instrumented to measure the elapsed cycles of each routine implementation. Moreover, we also use the nanobench library to assert the stability of the benchmarks and retrieve additional performance counters.

Usage

Pre-requisites

  • C++17 conforming compiler
  • CMake 3.16+

Build

cmake -S . -B build
cmake --build build

Run

./build/src/bench

Results

2-element de-interleaving ns/op op/s err% ins/op cyc/op cyc/op (instrument) IPC total runtime (s)
Scalar AArch64 106,791.62 9,364.03 0.6% 917,516.11 269,140.54 270,364 3.409 0.37
NEON 675,631.32 1,480.10 3.9% 3,145,746.11 1,736,985.47 1,987,030 1.811 2.33
SVE 642,586.97 1,556.21 3.4% 1,835,023.13 1,646,635.45 1,541,501 1.114 2.19
3-element de-interleaving ns/op op/s err% ins/op cyc/op cyc/op (instrument) IPC total runtime (s)
Scalar AArch64 156,505.43 6,389.55 0.2% 1,179,661.13 398,217.41 396,931 2.962 0.54
NEON 1,228,179.69 814.21 1.9% 3,670,036.13 3,158,617.08 3,244,919 1.162 4.19
SVE 1,215,011.03 823.04 2.3% 3,670,036.13 3,124,119.23 3,895,886 1.175 4.16
4-element de-interleaving ns/op op/s err% ins/op cyc/op cyc/op (instrument) IPC total runtime (s)
Scalar AArch64 267,079.80 3,744.20 0.3% 1,441,807.13 685,236.43 687,992 2.104 0.91
NEON 2,100,981.67 475.97 1.3% 4,194,326.13 5,407,170.85 5,731,691 0.776 7.20
SVE 2,086,051.44 479.37 1.6% 4,194,326.13 5,372,054.24 5,181,326 0.781 7.14

It appears that scalar implementations remain faster than SIMD approach in all tested cases.
Different buffer sizes may impact performance differently, so test as you wish. :)