Make zstd the default compression

Question

Make zstd the default compression

jonashaag opened this issue 9 months ago · 4 comments

I think zstd (or xz) should be the default compression. Gzip is very slow. All (relatively recent) readers I know support zstd decompression.

Answer 1 · 2023-12-03T16:49:45.000Z

Just to clarify. You can use --column-compression-default to change the default column compression. You are talking (I assume) about the default of this default?

Personally I must say, I do not have a good understanding which readers would support what format. Yet, if every reasonably popular reader in the past 3 years would support this. I would change it. If not, I would prefer the works out of the box experience and leave it to the power users to change the compression.

I can not make the judgement call, as I am not informed enough, this would be the fuzzy boundary for me to make the call. If you can tell me this is the case, I will change the default default encoding.

Answer 2 · 2023-12-03T16:56:14.000Z

Exactly!

PyArrow since 2018: apache/arrow@d4755e4
Pandas with PyArrow or fastparquet at least since 1.1 (2020), didn't check any older fast parquet
Polars since forever
Spark since 2017

Answer 3 · 2023-12-03T17:00:28.000Z

Polars' default zstd compression level seems to be 3. I'd trust their judgement.

Answer 4 · 2023-12-04T08:10:09.000Z

odbc2parquet 4.0.0 is released. It changed the default compression to zstd with compression level 3.