/parq

playing with rust dataframe support

Primary LanguageRustMIT LicenseMIT

parq

Examples of using Parquet, Arrow, Polars, or other dataframe readers in Rust. main.rs is a "sandbox" and iterations are moved into examples/.

There's really only one "live" (updated recently) project under dataframe keyword, which is Polars. Peroxide has a dataframe implementation but only csv and netcdf support, no Parquet/Arrow reader.

The other major live project right now with Arrow support is DataFusion, which has a similar feeling to Spark with a ctx object. However, this project seems to be aimed very much at SQL querying, with DataFrame lagging behind the Polars API. It doesn't appear to be possible to do the simple "apply a udf on a column to create a new column" yet. So although a distributed project Ballista extends this to multiple machines, it's not a replacement for Spark for processing (in contrast to querying) as yet.

how to run

alex-XPS-13-9350:~/git/parq/ ⌂ cargo run --example polars_w_col_lazy data/example.parquet
   Compiling parq v0.1.0 (/home/alex/git/parq)
    Finished dev [unoptimized + debuginfo] target(s) in 23.65s
     Running `target/debug/examples/polars_w_col_lazy data/example.parquet`
took: 4 ms
[examples/polars_w_col_lazy.rs:21] &df = shape: (3, 4)
┌──────┬─────┬───────────────────┬─────┐
│ one  ┆ two ┆ __index_level_0__ ┆ len │
│ ---  ┆ --- ┆ ---               ┆ --- │
│ f64  ┆ str ┆ str               ┆ u32 │
╞══════╪═════╪═══════════════════╪═════╡
│ -1.0 ┆ f   ┆ a                 ┆ 1   │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ null ┆ b   ┆ b                 ┆ 1   │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.4  ┆ d   ┆ c                 ┆ 1   │
└──────┴─────┴───────────────────┴─────┘

associated links