flashbots/mempool-dumpster

Archive Compression Considerations

metachris opened this issue · 1 comments

Input data

  • 14h CSVs (with raw transactions + timestamp + hash)
  • Transactions: 1,514,668
  • Disk usage: 1.7G
filename                            entries      size
txs-2023-08-07-10-00.csv             15,965       19M
txs-2023-08-07-11-00.csv            106,435      144M
txs-2023-08-07-12-00.csv            117,599      131M
txs-2023-08-07-13-00.csv            117,184      143M
txs-2023-08-07-14-00.csv            126,056      121M
txs-2023-08-07-15-00.csv            125,871      131M
txs-2023-08-07-16-00.csv            124,732      135M
txs-2023-08-07-17-00.csv            122,725      133M
txs-2023-08-07-18-00.csv            117,119      126M
txs-2023-08-07-19-00.csv            113,833      127M
txs-2023-08-07-20-00.csv            109,858      125M
txs-2023-08-07-21-00.csv            105,749      121M
txs-2023-08-07-22-00.csv            112,109      114M
txs-2023-08-07-23-00.csv             99,433      101M

Compression

Method Level Size Ratio Runtime
lz4 9 841M 0.49 38s
lz4 12 840M 0.49 1m 55s
zip 6 644M 0.38 45s
zip 9 640M 0.38 1m 23s
zstd 3 580M 0.34 9s
zstd 14 578M 0.34 2m 45s
zstd 15 577M 0.34 3m 47s
zstd 16 524M 0.31 4m 45s

Summarizer Script

  • Runtime: 1m 12s
  • Parquet output size: 74M (using gzip compression)

Going with zip for now, as it's compression is in the same ballpark as default zstd on this type of data, but is generally available (any user can just download the archive and extract it without requiring further software)