Enable parquet compresssion options
Closed this issue · 4 comments
It appears the rust parquet library supports compression but isn't used during query export. It would be useful to be able to specify parquet compression type / options via the CLI. I was experimenting with using this to export a large amount of data and the parquet files were many times larger than doing the same "manually" (eg. python) with compression.
Seems like a good idea to me. I'll probaly look into this during the weekend (no promises though). Could be an easy win if it just about passing a command line argument to the parquet writer.
New version 0.6.1
published with support for --column-compression-default
command line option. New default is gzip
. I only did the minimal thing here. There are a lot more option which could be forwarded like encoding or the ability to control encoding/compression for an individual column.
Let me know if this is already good enough for your use case, or if more is needed (or at least nice to have).
Cheers, Markus
Awesome, I'll give it a try! Appreciate the update.
If anyone runs into a similar issue or needs custom options and has python accessible, here is a short script to compress an existing parquet file.
import pyarrow as pa
import pyarrow.parquet as pq
import sys
pq_file = pq.ParquetFile(sys.argv[1])
with pq.ParquetWriter(sys.argv[2], pq_file.schema_arrow, compression='ZSTD') as writer:
for ri in range(pq_file.num_row_groups):
table = pq_file.read_row_group(ri)
writer.write_table(table)
Thank you, for you feedback and script!