Pcodec (or pco, pronounced "pico") losslessly compresses and decompresses numerical sequences with high compression ratio and fast speed.
Use cases include:
- columnar data
- long-term time series data
- serving numerical data to web clients
- low-bandwidth communication
Data types:
u32
, u64
, i32
, i64
, f32
, f64
It is also possible to implement your own data type via NumberLike
and (if
necessary) UnsignedLike
and FloatLike
.
For timestamps or smaller integers, it is probably best to simply cast to one
of the natively supported data types.
See the benchmarks to run the benchmark suite or see its results.
The core idea of pco is to represent numbers as approximate, entropy-coded bins paired with exact offsets into those bins. Depending on the mode, there may be up to 2 streams (latent variables) of these bin-offset pairings.
Pco is mainly meant to be wrapped into another format for production use cases. It has a hierarchy of multiple batches per page; multiple pages per chunk; and multiple chunks per file.
unit of ___ | size for good compression | |
---|---|---|
chunk | compression | >20k numbers |
page | interleaving w/ wrapping format | >1k numbers |
batch | decompression | 256 numbers (fixed) |
The standalone format is a minimal implementation of a wrapped format. It supports batched decompression only; no nullability, multiple columns, random access, seeking, or other niceties. It is mainly useful for quick proofs of concept and benchmarking.