/Turbo-Transpose

Transpose: SIMD Integer+Floating Point Compression Filter

Primary LanguageC

Integer + Floating Point Compression FilterBuild Status

  • Fastest transpose/shuffle
    • 🆕 (2019.11) ALL TurboTranspose functions now available under 64 bits ARMv8 including NEON SIMD.
    • Byte/Nibble transpose/shuffle for improving compression of binary data (ex. floating point data)
    • Scalar/SIMD Transpose/Shuffle 8,16,32,64,... bits
    • 👍 Dynamic CPU detection and JIT scalar/sse/avx2 switching
    • 100% C (C++ headers), usage as simple as memcpy
  • Byte Transpose
    • Fastest byte transpose
    • 🆕 (2019.11) 2D,3D,4D transpose
  • Nibble Transpose
    • nearly as fast as byte transpose
    • more efficient, up to 10 times! faster than Bitshuffle
    • 🆕 better compression (w/ lz77) and
      10 times! faster than one of the best floating-point compressors SPDP
    • can compress/decompress (w/ lz77) better and faster than other domain specific floating point compressors
  • Scalar and SIMD Transform
    • Delta encoding for sorted lists
    • Zigzag encoding for unsorted lists
    • Xor encoding
    • 🆕 lossy floating point compression with user-defined error

Transpose Benchmark:

  • Benchmark Intel CPU: Skylake i7-6700 3.4GHz gcc 9.2 single thread
  • Benchmark ARM: ARMv8 A73-ODROID-N2 1.8GHz

- Speed test

Benchmark w/ 16k buffer

BOLD = pareto frontier.
E:Encode, D:Decode

    ./tpbench -s# file -B16K   (# = 8,4,2)
E cycles/byte D cycles/byte Transpose 64 bits AVX2
.199 .134 TurboTranspose Byte
.326 .201 Blosc byteshuffle
.394 .260 TurboTranspose Nibble
.848 .478 Bitshuffle 8
E cycles/byte D cycles/byte Transpose 32 bits AVX2
.121 .102 TurboTranspose Byte
.451 .139 Blosc byteshuffle
.345 .229 TurboTranspose Nibble
.773 .476 Bitshuffle
E cycles/byte D cycles/byte Transpose 16 bits AVX2
.095 .071 TurboTranspose Byte
.640 .108 Blosc byteshuffle
.329 .198 TurboTranspose Nibble
.758 1.177 Bitshuffle 2
.067 .067 memcpy

E MB/s D MB/s 16 bits ARM 2019.11
8192 16384 TurboTranspose Byte
8192 8192 blosc byteshuffle
1638 2341 TurboTranspose Nibble
356 287 blosc bitshuffle
16384 16384 memcpy
E MB/s D MB/s 32 bits ARM 2019.11
8192 8192 TurboTranspose Byte
8192 8192 blosc byteshuffle
1820 2341 TurboTranspose Nibble
372 252 blosc bitshuffle
E MB/s D MB/s 64 bits ARM 2019.11
4096 8192 TurboTranspose Byte
5461 5461 blosc byteshuffle
1490 1490 TurboTranspose Nibble
372 260 blosc bitshuffle

Transpose/Shuffle benchmark w/ large files (100MB).

MB/s: 1,000,000 bytes/second

    ./tpbench -s# file  (# = 8,4,2)
E MB/s D MB/s Transpose 16 bits AVX2 2019.11
9208 9795 TurboTranspose Byte
8382 7689 Blosc byteshuffle
9377 9584 TurboTranspose Nibble
2750 2530 Blosc bitshuffle
13725 13900 memcpy
E MB/s D MB/s Transpose 32 bits AVX2 2019.11
9718 9713 TurboTranspose Byte
9181 9030 Blosc byteshuffle
8750 9472 TurboTranspose Nibble
2767 2942 Blosc bitshuffle 4
E MB/s D MB/s Transpose 64 bits AVX2 2019.11
8998 9573 TurboTranspose Byte
8721 8586 Blosc byteshuffle 2
8252 9222 TurboTranspose Nibble
2711 2053 Blosc bitshuffle 2

E MB/s D MB/s 16 bits ARM 2019.11
872 3998 TurboTranspose Byte
678 3852 blosc byteshuffle
1365 2195 TurboTranspose Nibble
357 280 blosc bitshuffle
3921 3913 memcpy
E MB/s D MB/s 32 bits ARM 2019.11
1828 3768 TurboTranspose Byte
1769 3713 blosc byteshuffle
1456 2299 TurboTranspose Nibble
374 243 blosc bitshuffle
E MB/s D MB/s 64 bits ARM 2019.11
1793 3572 TurboTranspose Byte
1784 3544 blosc byteshuffle
1176 1267 TurboTranspose Nibble
331 203 blosc bitshuffle

- Compression test (transpose/shuffle+lz4)

🆕 Download IcApp a new benchmark for TurboPFor+TurboTranspose
for testing allmost all integer and floating point file types.
Note: Lossy compression benchmark with icapp only.

- Speed test (file msg_sweep3d)
C size ratio % C MB/s D MB/s Name AVX2
11,348,554 18.1 2276 4425 TurboTranspose Nibble+lz
22,489,691 35.8 1670 3881 TurboTranspose Byte+lz
43,471,376 69.2 348 402 SPDP
44,626,407 71.0 1065 2101 bitshuffle+lz
62,865,612 100.0 13300 13300 memcpy
    ./tpbench -s4 -z *.sp
File File size lz % Tp8lz Tp4lz BSlz spdp1 spdp9 Tp4lzt eTp4lzt
msg_bt 133194716 94.3 70.4 66.4 73.9 70.0 67.4 54.7 32.4
msg_lu 97059484 100.4 77.1 70.4 75.4 76.8 74.0 61.0 42.2
msg_sppm 139497932 11.7 11.6 12.6 15.4 14.4 13.7 9.0 5.6
msg_sp 145052928 100.3 68.8 63.7 68.1 67.9 65.3 52.6 24.9
msg_sweep3d 62865612 98.7 35.8 18.1 71.0 69.6 13.7 9.8 3.8
num_brain 70920000 100.4 76.5 71.1 77.4 79.1 73.9 63.4 32.6
num_comet 53673984 92.4 79.0 77.6 82.1 84.5 84.6 70.1 41.7
num_control 79752372 99.4 89.5 90.7 88.1 98.3 98.5 81.4 51.2
num_plasma 17544800 100.4 0.7 0.7 75.5 30.7 2.9 0.3 0.2
obs_error 31080408 89.2 73.1 70.0 76.9 78.3 49.4 20.5 12.2
obs_info 9465264 93.6 70.2 61.9 72.9 62.4 43.8 27.3 15.1
obs_spitzer 99090432 98.3 90.4 95.6 93.6 100.1 100.7 80.2 52.3
obs_temp 19967136 100.4 89.5 92.4 91.0 99.4 100.1 84.0 55.8

Tp8=Byte transpose, Tp4=Nibble transpose, lz = lz4
eTp4Lzt = lossy compression with lzturbo and allowed error = 0.0001 (1e-4)
Slow but best compression: SPDP9 and lzt = lzturbo,39

File File size lz % Tp8lz Tp4lz BSlz spdp1 spdp9 Tp4lzt eTp4lzt
msg_bt 266389432 94.5 77.2 76.5 81.6 77.9 75.4 69.9 16.0
msg_lu 194118968 100.4 82.7 81.0 83.7 83.3 79.6 75.5 21.0
msg_sppm 278995864 18.9 14.5 14.9 19.5 21.5 19.8 11.2 2.8
msg_sp 290105856 100.4 79.2 77.5 80.2 78.8 77.1 71.3 12.4
msg_sweep3d 125731224 98.7 50.7 36.7 80.4 76.2 33.2 27.3 1.9
num_brain 141840000 100.4 82.6 81.1 84.5 87.8 83.3 77.0 16.3
num_comet 107347968 92.8 83.3 78.8 76.3 86.5 86.0 69.8 21.2
num_control 159504744 99.6 92.2 90.9 89.4 97.6 98.9 85.5 25.8
num_plasma 35089600 75.2 0.7 0.7 84.5 77.3 3.0 0.3 0.1
obs_error 62160816 78.7 81.0 77.5 84.4 87.9 62.3 23.4 6.3
obs_info 18930528 92.3 75.4 70.6 82.4 81.7 51.2 33.1 7.7
obs_spitzer 198180864 95.4 93.2 93.7 86.4 100.1 102.4 78.0 26.9
obs_temp 39934272 100.4 93.1 93.8 91.7 98.0 97.4 88.2 28.8

eTp4Lzt = lossy compression with allowed error = 0.0001

Compile:

    git clone git://github.com/powturbo/TurboTranspose.git
    cd TurboTranspose
Linux + Windows MingW
	make
    or
	make AVX2=1
Windows Visual C++
	nmake /f makefile.vs
    or
	nmake AVX2=1 /f makefile.vs
  • benchmark with other libraries
    download or clone bitshuffle or blosc and type

      make AVX2=1 BLOSC=1
      or
      make AVX2=1 BITSHUFFLE=1
    

Testing:

  • benchmark "transpose" functions

    ./tpbench [-s#] [-z] file
    s# = element size #=2,4,8,16,... (default 4) 
    -z = only lz77 compression benchmark (bitshuffle package mandatory)
    

Function usage:

Byte transpose:

void tpenc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize);
void tpdec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize)

in : input buffer
n : number of bytes
out : output buffer
esize : element size in bytes (2,4,8,...)

Nibble transpose:

void tp4enc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize);
void tp4dec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize)

in : input buffer
n : number of bytes
out : output buffer
esize : element size in bytes (2,4,8,...)

Environment:

OS/Compiler (64 bits):
  • Linux: GNU GCC (>=4.6)
  • Linux: Clang (>=3.2)
  • Windows: MinGW-w64 makefile
  • Windows: Visual c++ (>=VS2008) - makefile.vs (for nmake)
  • Windows: Visual Studio project file - vs/vs2017 - Thanks to PavelP
  • Linux ARM: 64 bits aarch64 ARMv8: gcc (>=6.3)
  • Linux ARM: 64 bits aarch64 ARMv8: clang
Multithreading:
  • All TurboTranspose functions are thread safe

References:

Last update: 25 Oct 2019