👍 Dynamic CPU detection and JIT scalar/sse/avx2 switching
100% C (C++ headers), usage as simple as memcpy
Byte Transpose
Fastest byte transpose
🆕 (2019.11) 2D,3D,4D transpose
Nibble Transpose
nearly as fast as byte transpose
more efficient, up to 10 times! faster than Bitshuffle
🆕 better compression (w/ lz77) and 10 times! faster than one of the best floating-point compressors SPDP
can compress/decompress (w/ lz77) better and faster than other domain specific floating point compressors
Scalar and SIMD Transform
Delta encoding for sorted lists
Zigzag encoding for unsorted lists
Xor encoding
🆕 lossy floating point compression with user-defined error
Transpose Benchmark:
Benchmark Intel CPU: Skylake i7-6700 3.4GHz gcc 9.2 single thread
Benchmark ARM: ARMv8 A73-ODROID-N2 1.8GHz
- Speed test
Benchmark w/ 16k buffer
BOLD = pareto frontier.
E:Encode, D:Decode
./tpbench -s# file -B16K (# = 8,4,2)
E cycles/byte
D cycles/byte
Transpose 64 bits AVX2
.199
.134
TurboTranspose Byte
.326
.201
Blosc byteshuffle
.394
.260
TurboTranspose Nibble
.848
.478
Bitshuffle 8
E cycles/byte
D cycles/byte
Transpose 32 bits AVX2
.121
.102
TurboTranspose Byte
.451
.139
Blosc byteshuffle
.345
.229
TurboTranspose Nibble
.773
.476
Bitshuffle
E cycles/byte
D cycles/byte
Transpose 16 bits AVX2
.095
.071
TurboTranspose Byte
.640
.108
Blosc byteshuffle
.329
.198
TurboTranspose Nibble
.758
1.177
Bitshuffle 2
.067
.067
memcpy
E MB/s
D MB/s
16 bits ARM 2019.11
8192
16384
TurboTranspose Byte
8192
8192
blosc byteshuffle
1638
2341
TurboTranspose Nibble
356
287
blosc bitshuffle
16384
16384
memcpy
E MB/s
D MB/s
32 bits ARM 2019.11
8192
8192
TurboTranspose Byte
8192
8192
blosc byteshuffle
1820
2341
TurboTranspose Nibble
372
252
blosc bitshuffle
E MB/s
D MB/s
64 bits ARM 2019.11
4096
8192
TurboTranspose Byte
5461
5461
blosc byteshuffle
1490
1490
TurboTranspose Nibble
372
260
blosc bitshuffle
Transpose/Shuffle benchmark w/ large files (100MB).
MB/s: 1,000,000 bytes/second
./tpbench -s# file (# = 8,4,2)
E MB/s
D MB/s
Transpose 16 bits AVX2 2019.11
9208
9795
TurboTranspose Byte
8382
7689
Blosc byteshuffle
9377
9584
TurboTranspose Nibble
2750
2530
Blosc bitshuffle
13725
13900
memcpy
E MB/s
D MB/s
Transpose 32 bits AVX2 2019.11
9718
9713
TurboTranspose Byte
9181
9030
Blosc byteshuffle
8750
9472
TurboTranspose Nibble
2767
2942
Blosc bitshuffle 4
E MB/s
D MB/s
Transpose 64 bits AVX2 2019.11
8998
9573
TurboTranspose Byte
8721
8586
Blosc byteshuffle 2
8252
9222
TurboTranspose Nibble
2711
2053
Blosc bitshuffle 2
E MB/s
D MB/s
16 bits ARM 2019.11
872
3998
TurboTranspose Byte
678
3852
blosc byteshuffle
1365
2195
TurboTranspose Nibble
357
280
blosc bitshuffle
3921
3913
memcpy
E MB/s
D MB/s
32 bits ARM 2019.11
1828
3768
TurboTranspose Byte
1769
3713
blosc byteshuffle
1456
2299
TurboTranspose Nibble
374
243
blosc bitshuffle
E MB/s
D MB/s
64 bits ARM 2019.11
1793
3572
TurboTranspose Byte
1784
3544
blosc byteshuffle
1176
1267
TurboTranspose Nibble
331
203
blosc bitshuffle
- Compression test (transpose/shuffle+lz4)
🆕 Download IcApp a new benchmark for TurboPFor+TurboTranspose
for testing allmost all integer and floating point file types.
Note: Lossy compression benchmark with icapp only.
eTp4Lzt = lossy compression with allowed error = 0.0001
Compile:
git clone git://github.com/powturbo/TurboTranspose.git
cd TurboTranspose
Linux + Windows MingW
make
or
make AVX2=1
Windows Visual C++
nmake /f makefile.vs
or
nmake AVX2=1 /f makefile.vs
benchmark with other libraries
download or clone bitshuffle or blosc and type
make AVX2=1 BLOSC=1
or
make AVX2=1 BITSHUFFLE=1
Testing:
benchmark "transpose" functions
./tpbench [-s#] [-z] file
s# = element size #=2,4,8,16,... (default 4)
-z = only lz77 compression benchmark (bitshuffle package mandatory)
Function usage:
Byte transpose:
void tpenc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize);
void tpdec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize)
in : input buffer
n : number of bytes
out : output buffer
esize : element size in bytes (2,4,8,...)
Nibble transpose:
void tp4enc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize);
void tp4dec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize)
in : input buffer
n : number of bytes
out : output buffer
esize : element size in bytes (2,4,8,...)
Environment:
OS/Compiler (64 bits):
Linux: GNU GCC (>=4.6)
Linux: Clang (>=3.2)
Windows: MinGW-w64 makefile
Windows: Visual c++ (>=VS2008) - makefile.vs (for nmake)
Windows: Visual Studio project file - vs/vs2017 - Thanks to PavelP