/density

Superfast compression library

Primary LanguageRustApache License 2.0Apache-2.0

Density

Superfast compression library

Density is a free, pure rust (formerly C99) open-source compression library.

It is focused on high-speed compression, at the best ratio possible. Density's algorithms are currently at the pareto frontier of compression speed vs ratio (cf. here for an independent benchmark).

Density features a simple API to enable quick integration in any project.

Why is it so fast ?

One of the biggest assets of density is its work unit: unlike most compression algorithms, it is not a byte but a group of 4 bytes.

When other libraries consume one byte of data to apply their algorithmic processing, density consumes 4 bytes.

That's why density's algorithms have been designed from scratch. They cater for 4-byte work units and still provide interesting compression ratios.

Speed pedigree traits

  • 4-byte work units
  • heavy use of registers as opposed to memory
  • avoidance of or use of minimal branching when possible
  • use of low memory data structures to favor processor cache Lx storage
  • library wide inlining

A "blowup protection" is provided, dramatically increasing the processing speed of incompressible input data. The aim is to never exceed original data size, even for incompressible inputs.

Benchmarks

Quick benchmark

Density features an integrated single-core in-memory benchmark.

Just use cargo bench to assess the performance of the library on your platform, as well as lz4's and snappy's, using the dickens file from the renowned silesia corpus.

It is also possible to run the benchmark with your own files, using the following command: FILE=... cargo bench (replace ... with a file path). Popular compression test files include the silesia corpus files or enwik8.

>> Link to a sample run log

Third-party benchmarks

Other benchmarks featuring density (non-exhaustive list) :

  • squash is an abstraction layer for compression algorithms, and has an extremely exhaustive set of benchmark results, including density's, available here.

  • lzbench is an in-memory benchmark of open-source LZ77/LZSS/LZMA compressors.

Build

Density can be built on a number of rust-compatible platforms. First use rustup to install rust.

a) get the source code:

    git clone https://github.com/g1mv/density.git
    cd density

b) build and test:

    cargo build
    cargo test

c) run benchmarks with or without your own files:

    cargo bench
    FILE=... cargo bench

About the algorithms

Density's algorithms are general purpose, very fast algorithms. They empirically exhibit strong performance (pareto frontier speed/ratio) on voluminous datasets and text-based datasets, and are less performant on very small datasets. Their 4-byte work unit approach makes them "slower learners" than Lempel-Ziv-based algorithms, albeit being much faster theoretically.

Algorithm Speed rank Ratio rank Dictionary unit(s) Prediction unit(s) Sig. size (bytes)
chameleon 1st 3rd 1 0 8
cheetah 2nd 2nd 2 1 8
lion 3rd 1st 2 5 6

Chameleon is a dictionary lookup based compression algorithm. It is designed for absolute speed (GB/s order) both for compression and decompression.

Cheetah, developed with inputs from Piotr Tarsa, is derived from chameleon and uses swapped dual dictionary lookups with a single prediction unit.

Lion is derived from chameleon/cheetah. It uses different data structures, dual dictionary lookups and 5 prediction units, giving it a compression ratio advantage on moderate to highly compressible data.

Quick start

Using density in your rust code is a straightforward process.

Include the required dependency in the Cargo.toml file:

[dependencies]
density = "0.16.0"

Use the API:

use std::fs::read;
use density::algorithms::chameleon::chameleon::Chameleon;

fn main() {
    let file_mem = read("file_path").unwrap();

    let mut encoded_mem = vec![0_u8; file_mem.len() << 1];
    let encoded_size = Chameleon::encode(&file_mem, &mut encoded_mem).unwrap();

    let mut decoded_mem = vec![0_u8; file_mem.len() << 1];
    let decoded_size = Chameleon::decode(&encoded_mem[0..encoded_size], &mut decoded_mem).unwrap();
}

And that's it! We've done a compression/decompression round trip with a few lines!