Columnix

Columnix is a columnar storage format, similar to Parquet and ORC.

The experiment was to beat Parquet's read performance in Spark for flat schemas, while simultaneously reducing the disk footprint by utilizing newer compression algorithms such as lz4 and zstd.

Columnix supports:

row groups
indexes (at the row group level, and file level)
vectorized reads
predicate pushdown
lazy reads
AVX2 and AVX512 predicate matching
memory-mapped IO

Spark's Parquet reader supports 1-4, but has no support for lazy reads, only limited SIMD support (whatever the JVM provides) and IO is through HDFS.

Support for complex schemas was not a goal of the project. The format has no support for Parquet's Dremel-style definition & repetition levels or ORC's compound types (struct, list, map, union).

The library does not currently support encoding of data prior to (or instead of) compression, for example run-length or dict encoding, despite placeholders in the code alluding to it. It was next on the TODO list, but I'd like to explore alternative approaches such as github.com/chriso/treecomp.

The following bindings are provided:

Python (ctypes): ./contrib/columnix.py
Spark (JNI): chriso/columnix-spark

One major caveat: the library uses mmap for reads. There is no HDFS compatibility and so there is limited real world use for the time being.

chriso/columnix

Columnix