flashbots/mempool-dumpster

Mempool Collector Stats + Discussions

metachris opened this issue · 5 comments

Some early stats about transactions collected and stored with the mempool-collector:

Hourly stats:

  • 80k - 120k transactions
  • CSV file
    • <timestampMillis>,<hash>,<rawTx>
    • 150MB uncompressed
    • 54MB gzipped

Extrapolated to a day:

  • Up to 3M transactions
  • Up to 1.5 GB compressed CSV raw data (per collector instance)

Note 2023-08-07: The stats below are outdated as they are based on the test storage method of one JSON file per transaction. Storage has now been updated to write into one CSV file per hour, which has very different compression characteristics.

Data collection

JSON file example: https://github.com/flashbots/mempool-archiver/blob/main/docs/example-tx-summary.json

Per hour:

  • 70k - 100k transactions
  • 150 - 500MB of JSON files written

Extrapolated to a day:

  • Up to 2.5M transactions
  • Up to 12 GB disk usage

Data size & compression

Looking at one particular hour specifically: 2023-08-04 UTC between [01:00, 02:00[:

  • Unique tx: 78,757 (find ./ -type f | wc -l)
  • Disk usage: 373 MB (du --si -s)
  • Apparent size: 134 MB (du --si -s --apparent-size)
  • Average file size: 1.584 KB (ls -l | gawk '{sum += $5; n++;} END {print n" "sum" "sum/n;}')

gzip individual JSON files:

  • Typical file size reduction: 50%
  • But: since the files are very small it doesn't actually decrease the disk usage:
$ du --si -s *
373M    h01
350M    h01_gz

$ du --si -s * --apparent-size
134M    h01
76M     h01_gz

more about "apparent-size": https://man7.org/linux/man-pages/man1/du.1.html

   --apparent-size
              print apparent sizes rather than device usage; although
              the apparent size is usually smaller, it may be larger due
              to holes in ('sparse') files, internal fragmentation,
              indirect blocks, and the like

zipping an hourly folder:

  • 80% reduction in disk space needed (373 MB -> 77 MB)
$ zip -r h01 h01

$ ls -alh h01.zip
-rw-r--r-- 1 ubuntu ubuntu 74M Aug  4 10:30 h01.zip

$ du --si h01.zip
77M     h01.zip

$ du --si h01.zip --apparent-size
77M     h01.zip

indeed, disk usage reports the actually used disk space by the file. since the filesystem stored data in blocks, then a non-sparse file will occupy diskBlockSize * roundUp(fileSize / diskBlockSize) >= fileSize. therefore zipping many small (smaller than diskBlockSize) files is not going to change much with respect to actual disk usage.

Some stats after creating a summary file of 1,423,508 transactions in JSON and Parquet format:

Format Size Signature Compression
CSV 314 MB No -
Parquet 118 MB No Snappy
CSV 529 MB Yes -
Parquet 248 MB Yes Snappy

Perhaps the individual tx JSON file should also not contain the signature, to save 20-40% of the storage space, and because the signature is part of the rawTx anyway 🤔

ra-- commented

everything is part of the rawtx, apart from timestamp and chainId

It's still convenient for the summarizer service not needing to parse every single rawTx and extracting the fields, although that doesn't seem too much to ask either.

I'm still undecided whether it's preferable to have the collector store some fields, or only store rawTx + timestamp (leaning towards only rawTx+timestamp, and batched+gzipped)