Mempool Collector Stats + Discussions
metachris opened this issue · 5 comments
Some early stats about transactions collected and stored with the mempool-collector:
Hourly stats:
- 80k - 120k transactions
- CSV file
<timestampMillis>,<hash>,<rawTx>
- 150MB uncompressed
- 54MB gzipped
Extrapolated to a day:
- Up to 3M transactions
- Up to 1.5 GB compressed CSV raw data (per collector instance)
Note 2023-08-07: The stats below are outdated as they are based on the test storage method of one JSON file per transaction. Storage has now been updated to write into one CSV file per hour, which has very different compression characteristics.
Data collection
JSON file example: https://github.com/flashbots/mempool-archiver/blob/main/docs/example-tx-summary.json
Per hour:
- 70k - 100k transactions
- 150 - 500MB of JSON files written
Extrapolated to a day:
- Up to 2.5M transactions
- Up to 12 GB disk usage
Data size & compression
Looking at one particular hour specifically: 2023-08-04 UTC between [01:00, 02:00[:
- Unique tx: 78,757 (
find ./ -type f | wc -l
) - Disk usage: 373 MB (
du --si -s
) - Apparent size: 134 MB (
du --si -s --apparent-size
) - Average file size: 1.584 KB (
ls -l | gawk '{sum += $5; n++;} END {print n" "sum" "sum/n;}'
)
gzip
individual JSON files:
- Typical file size reduction: 50%
- But: since the files are very small it doesn't actually decrease the disk usage:
$ du --si -s *
373M h01
350M h01_gz
$ du --si -s * --apparent-size
134M h01
76M h01_gz
more about "apparent-size": https://man7.org/linux/man-pages/man1/du.1.html
--apparent-size
print apparent sizes rather than device usage; although
the apparent size is usually smaller, it may be larger due
to holes in ('sparse') files, internal fragmentation,
indirect blocks, and the like
zipping an hourly folder:
- 80% reduction in disk space needed (373 MB -> 77 MB)
$ zip -r h01 h01
$ ls -alh h01.zip
-rw-r--r-- 1 ubuntu ubuntu 74M Aug 4 10:30 h01.zip
$ du --si h01.zip
77M h01.zip
$ du --si h01.zip --apparent-size
77M h01.zip
indeed, disk usage reports the actually used disk space by the file. since the filesystem stored data in blocks, then a non-sparse file will occupy diskBlockSize * roundUp(fileSize / diskBlockSize) >= fileSize
. therefore zipping many small (smaller than diskBlockSize
) files is not going to change much with respect to actual disk usage.
Some stats after creating a summary file of 1,423,508 transactions in JSON and Parquet format:
Format | Size | Signature | Compression |
---|---|---|---|
CSV | 314 MB | No | - |
Parquet | 118 MB | No | Snappy |
CSV | 529 MB | Yes | - |
Parquet | 248 MB | Yes | Snappy |
Perhaps the individual tx JSON file should also not contain the signature, to save 20-40% of the storage space, and because the signature is part of the rawTx anyway 🤔
everything is part of the rawtx, apart from timestamp and chainId
It's still convenient for the summarizer service not needing to parse every single rawTx and extracting the fields, although that doesn't seem too much to ask either.
I'm still undecided whether it's preferable to have the collector store some fields, or only store rawTx + timestamp (leaning towards only rawTx+timestamp, and batched+gzipped)