Mempool Collector Stats + Discussions

Question

Mempool Collector Stats + Discussions

metachris opened this issue a year ago · 5 comments

metachris commented a year ago

Some early stats about transactions collected and stored with the mempool-collector:

Hourly stats:

80k - 120k transactions
CSV file
- <timestampMillis>,<hash>,<rawTx>
- 150MB uncompressed
- 54MB gzipped

Extrapolated to a day:

Up to 3M transactions
Up to 1.5 GB compressed CSV raw data (per collector instance)

Note 2023-08-07: The stats below are outdated as they are based on the test storage method of one JSON file per transaction. Storage has now been updated to write into one CSV file per hour, which has very different compression characteristics.

Data collection

JSON file example: https://github.com/flashbots/mempool-archiver/blob/main/docs/example-tx-summary.json

Per hour:

70k - 100k transactions
150 - 500MB of JSON files written

Extrapolated to a day:

Up to 2.5M transactions
Up to 12 GB disk usage

Data size & compression

Looking at one particular hour specifically: 2023-08-04 UTC between [01:00, 02:00[:

Unique tx: 78,757 (find ./ -type f | wc -l)
Disk usage: 373 MB (du --si -s)
Apparent size: 134 MB (du --si -s --apparent-size)
Average file size: 1.584 KB (ls -l | gawk '{sum += $5; n++;} END {print n" "sum" "sum/n;}')

gzip individual JSON files:

Typical file size reduction: 50%
But: since the files are very small it doesn't actually decrease the disk usage:

$ du --si -s *
373M    h01
350M    h01_gz

$ du --si -s * --apparent-size
134M    h01
76M     h01_gz

more about "apparent-size": https://man7.org/linux/man-pages/man1/du.1.html

   --apparent-size
              print apparent sizes rather than device usage; although
              the apparent size is usually smaller, it may be larger due
              to holes in ('sparse') files, internal fragmentation,
              indirect blocks, and the like

zipping an hourly folder:

80% reduction in disk space needed (373 MB -> 77 MB)

$ zip -r h01 h01

$ ls -alh h01.zip
-rw-r--r-- 1 ubuntu ubuntu 74M Aug  4 10:30 h01.zip

$ du --si h01.zip
77M     h01.zip

$ du --si h01.zip --apparent-size
77M     h01.zip

Answer 1 · 2023-08-04T14:11:08.000Z

indeed, disk usage reports the actually used disk space by the file. since the filesystem stored data in blocks, then a non-sparse file will occupy diskBlockSize * roundUp(fileSize / diskBlockSize) >= fileSize. therefore zipping many small (smaller than diskBlockSize) files is not going to change much with respect to actual disk usage.

Answer 2 · 2023-08-06T10:16:05.000Z

Some stats after creating a summary file of 1,423,508 transactions in JSON and Parquet format:

Format	Size	Signature	Compression
CSV	314 MB	No	-
Parquet	118 MB	No	Snappy
CSV	529 MB	Yes	-
Parquet	248 MB	Yes	Snappy

Answer 3 · 2023-08-06T16:51:38.000Z

Perhaps the individual tx JSON file should also not contain the signature, to save 20-40% of the storage space, and because the signature is part of the rawTx anyway 🤔

Answer 4 · 2023-08-06T17:51:49.000Z

everything is part of the rawtx, apart from timestamp and chainId

Answer 5 · 2023-08-06T19:09:56.000Z

It's still convenient for the summarizer service not needing to parse every single rawTx and extracting the fields, although that doesn't seem too much to ask either.

I'm still undecided whether it's preferable to have the collector store some fields, or only store rawTx + timestamp (leaning towards only rawTx+timestamp, and batched+gzipped)