pacman82/odbc2parquet

Parquet Column Encoding

Closed this issue · 4 comments

bkief commented

Is there a way to define column encoding for the parquet format? For example, DELTA_BINARY_PACKED

For reference: https://docs.rs/parquet/4.1.0/parquet/encoding/index.html
Can be specified in the writer properties builder here: https://docs.rs/parquet/4.1.0/parquet/file/properties/struct.WriterPropertiesBuilder.html

Right now, there is no command line option to switch the encoding. There is no deeper reason to it other than that nobody has implemented it yet. At first glimpse it seems to only be a matter of forwarding command line arguments to the writer builder properties.

You want to change the encoding for a particular column, or do you want to change the default encoding for all columns of a certain type?

bkief commented

I think it should be on a per column basis because many of the encoding approaches are best done with some column ordering. Perhaps something like, --par_col_encoding col1name=3,col2name=4,col3name=5 where in ints are from the parquet encoding enum (https://github.com/sunchao/parquet-format-rs/blob/master/parquet.thrift)

Please let me know if version 0.6.6 solves your usecase. Thanks.