/strawboat

A native storage format for apache arrow

Primary LanguageRustApache License 2.0Apache-2.0

strawboat

A native storage format based on Apache Arrow.

strawboat is similar to Arrow IPC and is primarily aimed at optimizing the storage layer. We hope to use it in databend as another storage_format, and it is currently in a very early stage.

Difference with parquet

  • No RowGroup, row based multiple pages.

We think that multi RowGroup per file is useless in columnar database. To make life easier, we did not implement RowGroup (just like one single RowGroup per file).

Each Column will be spilted into fixed row size Page by WriteOptions. Page is the smallest unit of compression like parquet.

  • Zero-Overhead reading and writing.

  • No encoding/decoding.

Encoding/Decoding in parquet may introduce overhead when reading from storage. We hope that the memory format of our data can be easily and efficiently converted to and from the storage format, just like IPC. It is possible to combine encoding and compression functionality, as long as you have implemented a similar compression algorithm.

Storage Format Layout

TODO

DataTypes

  • Primitive
  • Binary/Utf8
  • Null
  • List
  • Fixed sized binary
  • Fixed sized list
  • Struct
  • Dictionary
  • Union
  • Map

Performance compare with parquet

TODO

Examples

// you need a simple parquet file in /tmp/input.st

// then generate pa file
cargo run --example strawboat_file_write --release /tmp/input.st     

// read pa file
cargo run --example strawboat_file_read  --release /tmp/input.st

// compare parquet reader
cargo run --example parquet_read --release /tmp/input.st