Berenger Bramas - Inria - berenger.bramas@inria.fr
Functions to sorts arrays of integers or doubles, or arrays of keys/values of integers (in the format of two arrays of integers or one array of pairs of integers) using the ARM SVE intrinsics. The code provides sort functions to small arrays (<= 16 SVE vectors), for larger arrays, for partitioning arrays, or to sort in parallel. The code also have two versions, one is following the SVE phylosophy by working for any vector size (known at runtime), whereas the second version (postfix "512") works only if the vectors are of size 512bits. This 512bits version is a pure copy of the AVX512 sort (code: https://gitlab.inria.fr/bramas/avx-512-sort paper: http://thesai.org/Publications/ViewPaper?Volume=8&Issue=10&Code=IJACSA&SerialNo=44).
The paper that describes the current ARM SVE sort is available as a preprint at: https://hal.inria.fr/hal-03227631
CPUs with ARM SVE feature are rare. Therefore, if you would like to compile a code with SVE intrinsics you can use the scalar and portable implementation of SVE https://gitlab.inria.fr/bramas/farm-sve .
In the current code, you will have to rename farm_sve.h
into arm_sve.h
(or create a link) and include the directory (usually -I.
if the file is located in the current directory).
Of course, the performance will mean nothing, but that will allow to test the code or update it without having the right CPU.
The CI file .gitlab-ci.yml
uses this mechanism.
The headers have no dependencies but needs OpenMP to provide the parallel sorting functions.
A script named build-script.sh
can be used to generate the binaries.
Building by hand can be done with:
# Set correct arch and compiler
MARCH="armv8.2-a"
CXX=armclang++
# Build
$CXX -DNDEBUG -O3 -march=$MARCH+sve -fopenmp sortSVEperf.cpp -o sortSVEperf.exe
$CXX -DNDEBUG -O3 -march=$MARCH+sve -fopenmp sortSVEtest.cpp -o sortSVEtest.exe
- arm/20.3 (armclang++): works
- gcc/11-20210321 (g++): cannot compile (the backend complains about unknown instructions and crashes)
- fujitsu-compiler/4.3.1 (FCC): cannot compile (internal compiler error with old code version, and unsupported tuple management with new code version)
- cce-sve/10.0.1 (CC): cannot compile "Unable to find cray-libsci/20.09.1.1 libraries compatible with cce/10.0.1." but cce is there! Note that the arch selection is different (-hcpu=native)
For namespace SortSVE
(sortSVE.hpp) and SortSVE512
(sortSVE512.hpp):
- SortSVE::Sort(); to sort an array
- SortSVE::SortOmpPartition(); to sort in parallel
- SortSVE::SortOmpMerge(); to sort in parallel
- SortSVE::SortOmpMergeDeps(); to sort in parallel (advised version)
- SortSVE::SortOmpParMerge(); to sort in parallel
- SortSVE::PartitionSVE(); to partition
- SortSVE::SmallSort16V(); to sort a small array (should be less than 16 AVX512 vectors)
Each function supports
int
anddouble
.
For namespace SortSVEkv
(sortSVEkv.hpp) and SortSVEkv512
(sortSVEkv512.hpp):
- SortSVE::Sort(); to sort an array
- SortSVE::PartitionSVE(); to partition
- SortSVE::SmallSort16V(); to sort a small array (should be less than 16 AVX512 vectors)
Each function supports
int
for both the keys and the values. Note thatSortSVEkv
supports both two arrays for keys/values or one array of pairs.
Refer to the preprint https://hal.inria.fr/hal-03227631:
@unpublished{bramas:hal-03227631,
TITLE = {{A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)}},
AUTHOR = {Bramas, B{\'e}renger},
URL = {https://hal.inria.fr/hal-03227631},
NOTE = {working paper or preprint},
YEAR = {2021},
MONTH = May,
PDF = {https://hal.inria.fr/hal-03227631/file/svesort.pdf},
HAL_ID = {hal-03227631},
HAL_VERSION = {v1},
}