Archive Compression Considerations
metachris opened this issue · 1 comments
metachris commented
Input data
- 14h CSVs (with raw transactions + timestamp + hash)
- Transactions: 1,514,668
- Disk usage: 1.7G
filename entries size
txs-2023-08-07-10-00.csv 15,965 19M
txs-2023-08-07-11-00.csv 106,435 144M
txs-2023-08-07-12-00.csv 117,599 131M
txs-2023-08-07-13-00.csv 117,184 143M
txs-2023-08-07-14-00.csv 126,056 121M
txs-2023-08-07-15-00.csv 125,871 131M
txs-2023-08-07-16-00.csv 124,732 135M
txs-2023-08-07-17-00.csv 122,725 133M
txs-2023-08-07-18-00.csv 117,119 126M
txs-2023-08-07-19-00.csv 113,833 127M
txs-2023-08-07-20-00.csv 109,858 125M
txs-2023-08-07-21-00.csv 105,749 121M
txs-2023-08-07-22-00.csv 112,109 114M
txs-2023-08-07-23-00.csv 99,433 101M
Compression
Method | Level | Size | Ratio | Runtime |
---|---|---|---|---|
lz4 | 9 | 841M | 0.49 | 38s |
lz4 | 12 | 840M | 0.49 | 1m 55s |
zip | 6 | 644M | 0.38 | 45s |
zip | 9 | 640M | 0.38 | 1m 23s |
zstd | 3 | 580M | 0.34 | 9s |
zstd | 14 | 578M | 0.34 | 2m 45s |
zstd | 15 | 577M | 0.34 | 3m 47s |
zstd | 16 | 524M | 0.31 | 4m 45s |
Summarizer Script
- Runtime: 1m 12s
- Parquet output size: 74M (using gzip compression)
metachris commented
Going with zip for now, as it's compression is in the same ballpark as default zstd on this type of data, but is generally available (any user can just download the archive and extract it without requiring further software)