Benchmark: Time Series - TurboPFor, TurboFloat, TurboFloat LzX, TurboGorilla,...
powturbo opened this issue · 0 comments
Most of time series databases (ex. DuckDB) are storing floating point data as 64 bits. They are reporting some extraordinary compression ratio by using a gorilla/chimp like algorithm.
However as shown in this benchmark, lot of time series data (ex. temparature, climate data, stocks,...) don't have more than 1 or 2 fixed decimal digits and can be stored losslessly in 16/32 bits integers. Integer compression algorithms can then be used, which results in significant compression ratio and several times faster as the gorilla like algorithms.
Gorilla or chimp based algos, cannot exceed 1GB/s in decompression, wherea TurboPFor is in the order of 15Gb/s, TurboBitByte is >100Gb/s.
We can also have 50% instant reduction by storing these values in 32 bits floats instead of 64 bits floats.
Note that outside time series with small changes between successive values the gorilla algorithm is of little use.
Not conviced? Verify yourself all these claims by downloading icapp and make your own benchmark with your data.
Datasets
1 - Floating point mode (32 bits floats) icapp city_temperature-fixed.csv -Ftf -K3 -I15 -J16 -e64,12,63,62 E MB/s size ratio D MB/s function floating point size=32 bits (lz=lz4,1) 593 6858573 59.01% 1262 64:fpcenc32 TurboFloat LzX bitio 1541 8323945 71.61% 9502 12:p4nzenc256v32 TurboPFor256 zigzag 690 8395129 72.23% 757 63:fphenc32 Chimp algo bitio 591 8741963 75.21% 826 62:fpgenc32 TurboGorilla bitio 2 - Integer mode: Convert floating point to 16 bits integer (multiplied by the number of decimal digits=1) and use Integer -Compression. 'city_temperature' is now compressed to 2998946 vs the 6858573 above. icapp city_temperature-fixed.csv -Fts.1 -K3 -I15 -J15 -v0 -e11,39,51 E MB/s size ratio D MB/s function integer size=16 bits (lz=lz4,1) 653.64 2998946 51.60% 3683.83 11:p4nzenc128v16 TurboPFor128 zigzag 366.69 3130714 53.87% 1123.45 39:vszenc16 TurboVSimple zigzag 4596.49 3221225 55.43% 6979.32 51:v8nzenc128v16 TByte+TPackV zigzag 3 - Quasi lossless mode (32 bits float): each input value is preprocessed by TurboRazor fprazor32 function using a pointwise relative error (PWE) according to the decimal digits of the value. (applicable only for text,csv files : option -FcA or -FtA ). Ex. For the value 72.9, we use the PWE 0.5 rounding this value with TurboRazor to get 72.8906. For the value 72.326 a PWE of 0.005 is used. icapp city_temperature-fixed.csv -FtfA -K3 -I15 -J15 -e62,65,66,67,68,77,64 E MB/s size ratio D MB/s function floating point size=32 bits (lz=lz4,1) 979 5118429 44.03% 1632 62:fpgenc32 TurboGorilla bitio 1543 5589023 48.08% 7822 65:fpxenc32 TurboFloat XOR 1207 5757626 49.53% 1818 66:fpfcmenc32 TurboFloat FCM 1286 5844696 50.28% 1561 67:fpdfcmenc32 TurboFloat DFCM 1276 5936571 51.07% 1559 68:fp2dfcmenc32 TurboFloat DFCM 2D 609 7935767 68.27% 1338 64:fpcenc32 TurboFloat LzX bitio
icapp Stocks-Germany-sample.txt -Ftf -K3 -I15 -J16 -e64,12,63,65,62 E MB/s size ratio D MB/s function floating point size=32 bits (lz=lz4,16) 621 6310939 31.55% 1251 64:fpcenc32 TurboFloat LzX bitio 1729 8582243 42.91% 10033 12:p4nzenc256v32 TurboPFor256 zigzag 684 10113105 50.57% 906 63:fphenc32 Chimp algo bitio 1469 10421070 52.11% 7783 65:fpxenc32 TurboFloat XOR 120 11048273 55.24% 2470 62:fpgenc32 TurboGorilla bitio icapp SSD_HDD_benchmarks.csv -Ftf -K3 -I15 -J16 -e64,64,62,12,63 815 15699 43.96% 2160 64:fpcenc32 TurboFloat LzX bitio 1121 20296 56.84% 6037 65:fpxenc32 TurboFloat XOR 1700 20752 58.12% 2551 62:fpgenc32 TurboGorilla bitio 1325 21013 58.85% 9146 12:p4nzenc256v32 TurboPFor256 zigzag 1764 21249 59.51% 2106 63:fphenc32 Chimp algo bitio icapp influxdb2/bird-migration-data/bird-migration.csv -Ftf -H4 -K7 -I15 -J16 -e64,63,62 763 30819 42.89% 2050 64:fpcenc32 TurboFloat LzX bitio 1455 36170 50.34% 1777 63:fphenc32 Chimp algo bitio 1422 36250 50.45% 3190 62:fpgenc32 TurboGorilla bitio
- TurboFloat LzX: TurboPFor's lz77 based algorithm for floating point data (can work with 16/32/64 bits floats).
TurboFloat LzX compress better and decompress faster than chimp128. - TurboGorilla : Improved compression and speed of the Gorilla algorithm in TurboPFor
- Chimp : Faster implementation in c of the chimp (Gorilla based) algorithm in TurboPFor
- The chimp128 idea was previously described in TsXor