powturbo/TurboPFor-Integer-Compression

Benchmark: Time Series - TurboPFor, TurboFloat, TurboFloat LzX, TurboGorilla,...

powturbo opened this issue · 0 comments

Most of time series databases (ex. DuckDB) are storing floating point data as 64 bits. They are reporting some extraordinary compression ratio by using a gorilla/chimp like algorithm.

However as shown in this benchmark, lot of time series data (ex. temparature, climate data, stocks,...) don't have more than 1 or 2 fixed decimal digits and can be stored losslessly in 16/32 bits integers. Integer compression algorithms can then be used, which results in significant compression ratio and several times faster as the gorilla like algorithms.
Gorilla or chimp based algos, cannot exceed 1GB/s in decompression, wherea TurboPFor is in the order of 15Gb/s, TurboBitByte is >100Gb/s.
We can also have 50% instant reduction by storing these values in 32 bits floats instead of 64 bits floats.
Note that outside time series with small changes between successive values the gorilla algorithm is of little use.

Not conviced? Verify yourself all these claims by downloading icapp and make your own benchmark with your data.

Datasets

1 - Floating point mode (32 bits floats)

icapp city_temperature-fixed.csv -Ftf -K3 -I15 -J16 -e64,12,63,62
E MB/s     size    ratio   D MB/s function floating point size=32 bits (lz=lz4,1) 
  593    6858573  59.01%    1262  64:fpcenc32         TurboFloat LzX bitio        
 1541    8323945  71.61%    9502  12:p4nzenc256v32    TurboPFor256 zigzag
  690    8395129  72.23%     757  63:fphenc32         Chimp  algo    bitio
  591    8741963  75.21%     826  62:fpgenc32         TurboGorilla   bitio

2 - Integer mode: Convert floating point to 16 bits integer (multiplied by the number of decimal digits=1)
and use Integer -Compression.  'city_temperature' is now compressed to 2998946 vs the 6858573 above.

icapp city_temperature-fixed.csv -Fts.1 -K3 -I15 -J15 -v0 -e11,39,51
E MB/s      size     ratio     D MB/s function integer size=16 bits (lz=lz4,1)
  653.64    2998946  51.60%    3683.83  11:p4nzenc128v16    TurboPFor128 zigzag
  366.69    3130714  53.87%    1123.45  39:vszenc16         TurboVSimple zigzag
 4596.49    3221225  55.43%    6979.32  51:v8nzenc128v16    TByte+TPackV zigzag

3 - Quasi lossless mode (32 bits float): each input value is preprocessed by TurboRazor fprazor32 function using a
pointwise relative error (PWE) according to the decimal digits of the value.
(applicable only for text,csv files : option -FcA or -FtA ).
Ex. For the value 72.9, we use the PWE 0.5 rounding this value with TurboRazor to get 72.8906. 
For the value 72.326 a PWE of 0.005 is used.

icapp city_temperature-fixed.csv -FtfA -K3 -I15 -J15 -e62,65,66,67,68,77,64
E MB/s     size    ratio   D MB/s function floating point size=32 bits (lz=lz4,1)
  979    5118429  44.03%    1632  62:fpgenc32         TurboGorilla   bitio
 1543    5589023  48.08%    7822  65:fpxenc32         TurboFloat XOR
 1207    5757626  49.53%    1818  66:fpfcmenc32       TurboFloat FCM
 1286    5844696  50.28%    1561  67:fpdfcmenc32      TurboFloat DFCM
 1276    5936571  51.07%    1559  68:fp2dfcmenc32     TurboFloat DFCM 2D
  609    7935767  68.27%    1338  64:fpcenc32         TurboFloat LzX bitio

icapp Stocks-Germany-sample.txt -Ftf -K3 -I15 -J16  -e64,12,63,65,62
 E MB/s     size     ratio     D MB/s function floating point size=32 bits (lz=lz4,16)
  621    6310939  31.55%    1251  64:fpcenc32         TurboFloat LzX bitio
 1729    8582243  42.91%   10033  12:p4nzenc256v32    TurboPFor256 zigzag
  684   10113105  50.57%     906  63:fphenc32         Chimp  algo    bitio
 1469   10421070  52.11%    7783  65:fpxenc32         TurboFloat XOR
 120    11048273  55.24%    2470  62:fpgenc32         TurboGorilla   bitio

icapp SSD_HDD_benchmarks.csv -Ftf -K3 -I15 -J16 -e64,64,62,12,63
  815      15699  43.96%    2160  64:fpcenc32         TurboFloat LzX bitio
 1121      20296  56.84%    6037  65:fpxenc32         TurboFloat XOR 
 1700      20752  58.12%    2551  62:fpgenc32         TurboGorilla   bitio
 1325      21013  58.85%    9146  12:p4nzenc256v32    TurboPFor256 zigzag
 1764      21249  59.51%    2106  63:fphenc32         Chimp  algo    bitio

icapp influxdb2/bird-migration-data/bird-migration.csv -Ftf -H4 -K7 -I15 -J16 -e64,63,62
  763      30819  42.89%    2050  64:fpcenc32         TurboFloat LzX bitio
 1455      36170  50.34%    1777  63:fphenc32         Chimp  algo    bitio
 1422      36250  50.45%    3190  62:fpgenc32         TurboGorilla   bitio
  • TurboFloat LzX: TurboPFor's lz77 based algorithm for floating point data (can work with 16/32/64 bits floats).
    TurboFloat LzX compress better and decompress faster than chimp128.
  • TurboGorilla : Improved compression and speed of the Gorilla algorithm in TurboPFor
  • Chimp : Faster implementation in c of the chimp (Gorilla based) algorithm in TurboPFor
  • The chimp128 idea was previously described in TsXor