Archive Compression Considerations

Question

Archive Compression Considerations

metachris opened this issue a year ago · 1 comments

Input data

14h CSVs (with raw transactions + timestamp + hash)
Transactions: 1,514,668
Disk usage: 1.7G

filename                            entries      size
txs-2023-08-07-10-00.csv             15,965       19M
txs-2023-08-07-11-00.csv            106,435      144M
txs-2023-08-07-12-00.csv            117,599      131M
txs-2023-08-07-13-00.csv            117,184      143M
txs-2023-08-07-14-00.csv            126,056      121M
txs-2023-08-07-15-00.csv            125,871      131M
txs-2023-08-07-16-00.csv            124,732      135M
txs-2023-08-07-17-00.csv            122,725      133M
txs-2023-08-07-18-00.csv            117,119      126M
txs-2023-08-07-19-00.csv            113,833      127M
txs-2023-08-07-20-00.csv            109,858      125M
txs-2023-08-07-21-00.csv            105,749      121M
txs-2023-08-07-22-00.csv            112,109      114M
txs-2023-08-07-23-00.csv             99,433      101M

Compression

Method	Level	Size	Ratio	Runtime
lz4	9	841M	0.49	38s
lz4	12	840M	0.49	1m 55s
zip	6	644M	0.38	45s
zip	9	640M	0.38	1m 23s
zstd	3	580M	0.34	9s
zstd	14	578M	0.34	2m 45s
zstd	15	577M	0.34	3m 47s
zstd	16	524M	0.31	4m 45s

Summarizer Script

Runtime: 1m 12s
Parquet output size: 74M (using gzip compression)

Answer 1 · 2023-08-09T19:38:22.000Z

Going with zip for now, as it's compression is in the same ballpark as default zstd on this type of data, but is generally available (any user can just download the archive and extract it without requiring further software)