Enable parquet compresssion options

Question

Enable parquet compresssion options

Closed this issue 4 years ago · 4 comments

It appears the rust parquet library supports compression but isn't used during query export. It would be useful to be able to specify parquet compression type / options via the CLI. I was experimenting with using this to export a large amount of data and the parquet files were many times larger than doing the same "manually" (eg. python) with compression.

Answer 1 · 2021-04-14T20:05:22.000Z

Seems like a good idea to me. I'll probaly look into this during the weekend (no promises though). Could be an easy win if it just about passing a command line argument to the parquet writer.

Answer 2 · 2021-04-18T10:56:02.000Z

New version 0.6.1 published with support for --column-compression-default command line option. New default is gzip. I only did the minimal thing here. There are a lot more option which could be forwarded like encoding or the ability to control encoding/compression for an individual column.

Let me know if this is already good enough for your use case, or if more is needed (or at least nice to have).

Cheers, Markus

Answer 3 · 2021-04-19T14:08:24.000Z

Awesome, I'll give it a try! Appreciate the update.

If anyone runs into a similar issue or needs custom options and has python accessible, here is a short script to compress an existing parquet file.

import pyarrow as pa
import pyarrow.parquet as pq
import sys

pq_file =  pq.ParquetFile(sys.argv[1])
with pq.ParquetWriter(sys.argv[2], pq_file.schema_arrow, compression='ZSTD') as writer:
    for ri in range(pq_file.num_row_groups):
        table = pq_file.read_row_group(ri)
        writer.write_table(table)

Answer 4 · 2021-04-19T15:41:49.000Z

Thank you, for you feedback and script!