Columnix is a columnar storage format, similar to Parquet and ORC.
The experiment was to beat Parquet's read performance in Spark for flat schemas, while simultaneously reducing the disk footprint by utilizing newer compression algorithms such as lz4 and zstd.
Columnix supports:
- row groups
- indexes (at the row group level, and file level)
- vectorized reads
- predicate pushdown
- lazy reads
- AVX2 and AVX512 predicate matching
- memory-mapped IO
Spark's Parquet reader supports 1-4, but has no support for lazy reads, only limited SIMD support (whatever the JVM provides) and IO is through HDFS.
Support for complex schemas was not a goal of the project. The format has no support for Parquet's Dremel-style definition & repetition levels or ORC's compound types (struct, list, map, union).
The library does not currently support encoding of data prior to (or instead of) compression, for example run-length or dict encoding, despite placeholders in the code alluding to it. It was next on the TODO list, but I'd like to explore alternative approaches such as github.com/chriso/treecomp.
The following bindings are provided:
- Python (ctypes): ./contrib/columnix.py
- Spark (JNI): chriso/columnix-spark
One major caveat: the library uses mmap
for reads. There is no HDFS compatibility and
so there is limited real world use for the time being.