Make zstd the default compression
jonashaag opened this issue · 4 comments
I think zstd (or xz) should be the default compression. Gzip is very slow. All (relatively recent) readers I know support zstd decompression.
Just to clarify. You can use --column-compression-default
to change the default column compression. You are talking (I assume) about the default of this default?
Personally I must say, I do not have a good understanding which readers would support what format. Yet, if every reasonably popular reader in the past 3 years would support this. I would change it. If not, I would prefer the works out of the box experience and leave it to the power users to change the compression.
I can not make the judgement call, as I am not informed enough, this would be the fuzzy boundary for me to make the call. If you can tell me this is the case, I will change the default default encoding.
Exactly!
- PyArrow since 2018: apache/arrow@d4755e4
- Pandas with PyArrow or fastparquet at least since 1.1 (2020), didn't check any older fast parquet
- Polars since forever
- Spark since 2017
Polars' default zstd compression level seems to be 3. I'd trust their judgement.
odbc2parquet 4.0.0
is released. It changed the default compression to zstd
with compression level 3
.