Mempool Dumpster 🗑️♻️
Dump mempool transactions from EL nodes, and archive them in Parquet and CSV format.
The data is freely available at https://mempool-dumpster.flashbots.net
Output files:
- Raw transactions CSV (
timestamp_ms, tx_hash, rlp_hex
; about 800MB/day zipped) - Sourcelog CSV - list of received transactions by any source (
timestamp_ms, hash, source
; about 100MB/day zipped) - Transaction metadata in CSV and Parquet format (~100MB/day zipped)
- Summary file with information about transaction sources and latency (example)
Available mempool sources:
- Generic EL nodes (
newPendingTransactions
) (i.e. go-ethereum, Infura, etc.) - Alchemy (
alchemy_pendingTransactions
) - bloXroute (at least "Professional" plan)
- Chainbound Fiber
Notes:
- This project is under active development, although relatively stable and ready to use in production
- Observing about 1M - 1.5M unique transactions per day
FAQ
- What is a-pool? ... A-Pool is a regular geth node with some optimized peering settings, subscribed to over the network.
- What are exclusive transactions? ... a transaction that was seen from no other source (transaction only provided by a single source)
System architecture
- Collector: Connects to EL nodes and writes new mempool transactions to CSV files. Multiple collector instances can run without colliding.
- Merger: Takes collector CSV files as input, de-duplicates, sorts by timestamp and writes CSV + Parquet output files.
- Analyzer: Analyzes sourcelog CSV files and produces summary report.
- Website: Website dev-mode as well as build + upload.
Getting started
Mempool Collector
- Subscribes to new pending transactions at various data sources
- Writes
timestamp_ms
+hash
+raw_tx
to CSV file (one file per hour by default) - Note: the collector can store transactions repeatedly, and only the merger will properly deduplicate them later
Default filenames:
Transactions
- Schema:
<out_dir>/<date>/transactions/txs_<date>_<uid>.csv
- Example:
out/2023-08-07/transactions/txs_2023-08-07-10-00_collector1.csv
Sourcelog
- Schema:
<out_dir>/<date>/sourcelog/src_<date>_<uid>.csv
- Example:
out/2023-08-07/sourcelog/src_2023-08-07-10-00_collector1.csv
Running the mempool collector:
# print help
go run cmd/collector/main.go -help
# Connect to ws://localhost:8546 and write CSVs into ./out
go run cmd/collector/main.go -out ./out
# Connect to multiple nodes
go run cmd/collector/main.go -out ./out -nodes ws://server1.com:8546,ws://server2.com:8546
Merger
- Iterates over collector output directory / CSV files
- Deduplicates transactions, sorts them by timestamp
go run cmd/merge/main.go -h
Architecture
General design goals
- Keep it simple and stupid
- Vendor-agnostic (main flow should work on any server, independent of a cloud provider)
- Downtime-resilience to minimize any gaps in the archive
- Multiple collector instances can run concurrently, without getting into each others way
- Merger produces the final archive (based on the input of multiple collector outputs)
- The final archive:
- Includes (1) parquet file with transaction metadata, and (2) compressed file of raw transaction CSV files
- Compatible with Clickhouse and S3 Select (Parquet using gzip compression)
- Easily distributable as torrent
Collector
NodeConnection
- One for each EL connection
- New pending transactions are sent to
TxProcessor
via a channel
TxProcessor
- Check if it already processed that tx
- Store it in the output directory
Merger
- Uses https://github.com/xitongsys/parquet-go to write Parquet format
Transaction RLP format
- encoding transactions in typed EIP-2718 envelopes:
Contributing
Install dependencies
go install mvdan.cc/gofumpt@latest
go install honnef.co/go/tools/cmd/staticcheck@latest
go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest
go install github.com/daixiang0/gci@latest
Lint, test, format
make lint
make test
make fmt
Further notes
- See also: discussion about compression and storage
License
MIT