Floating point compression
Closed this issue · 1 comments
Tree-buf requires the integration of one or more high-performance floating-point compression libraries. The currently integrated Gorilla compressor panics at runtime, and the initial investigation indicates that both faster and higher compression ratio techniques are available.
The task is to research algorithms and libraries, evaluate their trade-offs using multiple real-world data sets, and integrate one or more of the best into Tree-Buf.
This is a huge undertaking. It is likely that this issue will be split into multiple issues but the first step is probably to understand the space and make a realistic evaluation of what algorithms are candidates.
Mentoring
I'll be glad to answer any questions about Tree-Buf, it's architecture, internal APIs, and benchmarks. For SIMD and compression, you're on your own.
Candidate algorithms and resources
- ZFP & FPZIP
- Akamuli
- Gorilla
- FPC
- http://blog.omega-prime.co.uk/2016/01/25/compression-of-floating-point-timeseries/
- LCPFPV
- https://userweb.cs.txstate.edu/~mb92/papers/dcc06.pdf
- Others?
In the event that some algorithms perform better on different data sets, it is not necessary that just one floating-point compression algorithm be selected. Tree-Buf does a pass on a sample of the data set when selecting what algorithm to use and may support options that allow the user to select between speed or high compression ratios. There are also options to enable lossy compression within a given tolerance.
Requirements
- Utilize SIMD, but have fallbacks on platforms that do not support it.
- Implemented in pure Rust in order to not require complicated build steps or installation of dependencies.
- Unsafe is ok when necessary
- Able to compile to WebAssembly, at least eventually.
Much of the work need not exist in the Tree-Buf repository. If you want to put the actual compression algorithms in a separate crate that is not only ok but encouraged.
What's been tried already?
- Gorilla, via the gibbon crate. It's panicking at runtime when compiled in debug.
- Gorilla, via the tsz crate. The API does not separate the timestamps from the data points.
- zfp, via ndarray-zfp-rs. It fails to compile on Windows, but likely would run into problems on WebAssembly later. The zfp family of algorithms is a likely candidate in the long term.
The Gorilla situation has significantly improved, closing for now.