pacman82/odbc2parquet

Make zstd the default compression

jonashaag opened this issue · 4 comments

I think zstd (or xz) should be the default compression. Gzip is very slow. All (relatively recent) readers I know support zstd decompression.

Just to clarify. You can use --column-compression-default to change the default column compression. You are talking (I assume) about the default of this default?

Personally I must say, I do not have a good understanding which readers would support what format. Yet, if every reasonably popular reader in the past 3 years would support this. I would change it. If not, I would prefer the works out of the box experience and leave it to the power users to change the compression.

I can not make the judgement call, as I am not informed enough, this would be the fuzzy boundary for me to make the call. If you can tell me this is the case, I will change the default default encoding.

Exactly!

  • PyArrow since 2018: apache/arrow@d4755e4
  • Pandas with PyArrow or fastparquet at least since 1.1 (2020), didn't check any older fast parquet
  • Polars since forever
  • Spark since 2017

Polars' default zstd compression level seems to be 3. I'd trust their judgement.

odbc2parquet 4.0.0 is released. It changed the default compression to zstd with compression level 3.