Mempool Dumpster 🗑️♻️
Dump mempool transactions from EL nodes, and archive them in Parquet and CSV format.
Notes:
- The data is freely available at https://mempool-dumpster.flashbots.net
- This project is under active development, although relatively stable and ready to use in production
- Observing about 1M - 1.5M unique transactions per day
Available mempool sources
- Generic EL nodes (
newPendingTransactions
) (i.e. go-ethereum, Infura, etc.) - Alchemy (
alchemy_pendingTransactions
) - bloXroute (at least "Professional" plan)
- Chainbound Fiber
- Eden
Output files
Daily files uploaded by mempool-dumpster (i.e. for September 2023):
- Parquet file with transaction metadata and raw transactions (~800MB/day, i.e.
2023-09-08.parquet
) - CSV file with only the transaction metadata (~100MB/day zipped, i.e.
2023-09-08.csv.zip
) - CSV file with details about when each transaction was received by any source (~100MB/day zipped, i.e.
2023-09-08_sourcelog.csv.zip
) - Summary in text format (~2kB, i.e.
2023-09-08_summary.txt
)
FAQ
- What is a-pool? ... A-Pool is a regular geth node with some optimized peering settings, subscribed to over the network.
- What are exclusive transactions? ... a transaction that was seen from no other source (transaction only provided by a single source)
Working with Parquet
Apache Parquet is a column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk (more here).
We recommend to use clickhouse local to work with Parquet files, it makes it easy to run queries like:
# count rows
$ clickhouse local -q 'select count(*) from "transactions.parquet" limit 1;'
# get the first hash+rawTx
$ clickhouse local -q 'select hash,hex(rawTx) from "transactions.parquet" limit 1;'
# show the schema
$ clickhouse local -q 'describe table "transactions.parquet";'
timestamp Nullable(DateTime64(3))
hash Nullable(String)
chainId Nullable(String)
from Nullable(String)
to Nullable(String)
value Nullable(String)
nonce Nullable(String)
gas Nullable(String)
gasPrice Nullable(String)
gasTipCap Nullable(String)
gasFeeCap Nullable(String)
dataSize Nullable(Int64)
data4Bytes Nullable(String)
rawTx Nullable(String)
System architecture
- Collector: Connects to EL nodes and writes new mempool transactions and sourcelog to hourly CSV files. Multiple collector instances can run without colliding.
- Merger: Takes collector CSV files as input, de-duplicates, sorts by timestamp and writes CSV + Parquet output files.
- Analyzer: Analyzes sourcelog CSV files and produces summary report.
- Website: Website dev-mode as well as build + upload.
Getting started
Mempool Collector
- Subscribes to new pending transactions at various data sources
- Writes
timestamp_ms
+hash
+raw_tx
to CSV file (one file per hour by default) - Note: the collector can store transactions repeatedly, and only the merger will properly deduplicate them later
Default filenames:
Transactions
- Schema:
<out_dir>/<date>/transactions/txs_<date>_<uid>.csv
- Example:
out/2023-08-07/transactions/txs_2023-08-07-10-00_collector1.csv
Sourcelog
- Schema:
<out_dir>/<date>/sourcelog/src_<date>_<uid>.csv
- Example:
out/2023-08-07/sourcelog/src_2023-08-07-10-00_collector1.csv
Running the mempool collector:
# print help
go run cmd/collect/main.go -help
# Connect to ws://localhost:8546 and write CSVs into ./out
go run cmd/collect/main.go -out ./out
# Connect to multiple nodes
go run cmd/collect/main.go -out ./out -nodes ws://server1.com:8546,ws://server2.com:8546
Merger
- Iterates over collector output directory / CSV files
- Deduplicates transactions, sorts them by timestamp
go run cmd/merge/main.go -h
Architecture
General design goals
- Keep it simple and stupid
- Vendor-agnostic (main flow should work on any server, independent of a cloud provider)
- Downtime-resilience to minimize any gaps in the archive
- Multiple collector instances can run concurrently, without getting into each others way
- Merger produces the final archive (based on the input of multiple collector outputs)
- The final archive:
- Includes (1) parquet file with transaction metadata, and (2) compressed file of raw transaction CSV files
- Compatible with Clickhouse and S3 Select (Parquet using gzip compression)
- Easily distributable as torrent
Collector
NodeConnection
- One for each EL connection
- New pending transactions are sent to
TxProcessor
via a channel
TxProcessor
- Check if it already processed that tx
- Store it in the output directory
Merger
- Uses https://github.com/xitongsys/parquet-go to write Parquet format
Transaction RLP format
- encoding transactions in typed EIP-2718 envelopes:
Contributing
Install dependencies
go install mvdan.cc/gofumpt@latest
go install honnef.co/go/tools/cmd/staticcheck@latest
go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest
go install github.com/daixiang0/gci@latest
Lint, test, format
make lint
make test
make fmt
Further notes
- See also: discussion about compression and storage
License
MIT