Pcodec

bar charts showing better compression for pco than zstd.parquet

Pcodec (or pco, pronounced "pico") losslessly compresses and decompresses numerical sequences with high compression ratio and fast speed.

Use cases include:

columnar data
long-term time series data
serving numerical data to web clients
low-bandwidth communication

Data types: u32, u64, i32, i64, f32, f64

It is also possible to implement your own data type via NumberLike and (if necessary) UnsignedLike and FloatLike. For timestamps or smaller integers, it is probably best to simply cast to one of the natively supported data types.

Get Started

Use the CLI

Use the Rust API

Performance and Compression Ratio

See the benchmarks to run the benchmark suite or see its results.

File Format

The core idea of pco is to represent numbers as approximate, entropy-coded bins paired with exact offsets into those bins. Depending on the mode, there may be up to 2 streams (latent variables) of these bin-offset pairings.

Pco is mainly meant to be wrapped into another format for production use cases. It has a hierarchy of multiple batches per page; multiple pages per chunk; and multiple chunks per file.

	unit of ___	size for good compression
chunk	compression	>20k numbers
page	interleaving w/ wrapping format	>1k numbers
batch	decompression	256 numbers (fixed)

The standalone format is a minimal implementation of a wrapped format. It supports batched decompression only; no nullability, multiple columns, random access, seeking, or other niceties. It is mainly useful for quick proofs of concept and benchmarking.

Contributing

see CONTRIBUTING.md

Extra

join the Discord

terminology