/pcodec

Lossless compressor and decompressor for numerical data using quantiles

Primary LanguageRustApache License 2.0Apache-2.0

Pcodec

bar charts showing better compression for pco than zstd.parquet

Pcodec (or pco, pronounced "pico") losslessly compresses and decompresses numerical sequences with high compression ratio and fast speed.

Use cases include:

  • columnar data
  • long-term time series data
  • serving numerical data to web clients
  • low-bandwidth communication

Data types: u32, u64, i32, i64, f32, f64

It is also possible to implement your own data type via NumberLike and (if necessary) UnsignedLike and FloatLike. For timestamps or smaller integers, it is probably best to simply cast to one of the natively supported data types.

Get Started

Use the CLI

Use the Rust API

Performance and Compression Ratio

See the benchmarks to run the benchmark suite or see its results.

File Format

pco wrapped format diagram

The core idea of pco is to represent numbers as approximate, entropy-coded bins paired with exact offsets into those bins. Depending on the mode, there may be up to 2 streams (latent variables) of these bin-offset pairings.

pco compression and decompression steps

Pco is mainly meant to be wrapped into another format for production use cases. It has a hierarchy of multiple batches per page; multiple pages per chunk; and multiple chunks per file.

unit of ___ size for good compression
chunk compression >20k numbers
page interleaving w/ wrapping format >1k numbers
batch decompression 256 numbers (fixed)

The standalone format is a minimal implementation of a wrapped format. It supports batched decompression only; no nullability, multiple columns, random access, seeking, or other niceties. It is mainly useful for quick proofs of concept and benchmarking.

Contributing

see CONTRIBUTING.md

Extra

join the Discord

terminology

Quantile Compression