/mempool-dumpster

Dump all the mempool transactions 🗑️ ♻️ (in Parquet + CSV)

Primary LanguageGoMIT LicenseMIT

Mempool Dumpster 🗑️♻️

Goreport status Test status

Dump mempool transactions from EL nodes, and archive them in Parquet and CSV format.

The data is freely available at https://mempool-dumpster.flashbots.net

Output files:

  1. Raw transactions CSV (timestamp_ms, tx_hash, rlp_hex; about 800MB/day zipped)
  2. Sourcelog CSV - list of received transactions by any source (timestamp_ms, hash, source; about 100MB/day zipped)
  3. Transaction metadata in CSV and Parquet format (~100MB/day zipped)
  4. Summary file with information about transaction sources and latency (example)

Available mempool sources:

  1. Generic EL nodes (newPendingTransactions) (i.e. go-ethereum, Infura, etc.)
  2. Alchemy (alchemy_pendingTransactions)
  3. bloXroute (at least "Professional" plan)
  4. Chainbound Fiber

Notes:

  • This project is under active development, although relatively stable and ready to use in production
  • Observing about 1M - 1.5M unique transactions per day

FAQ

  • What is a-pool? ... A-Pool is a regular geth node with some optimized peering settings, subscribed to over the network.
  • What are exclusive transactions? ... a transaction that was seen from no other source (transaction only provided by a single source)

System architecture

  1. Collector: Connects to EL nodes and writes new mempool transactions to CSV files. Multiple collector instances can run without colliding.
  2. Merger: Takes collector CSV files as input, de-duplicates, sorts by timestamp and writes CSV + Parquet output files.
  3. Analyzer: Analyzes sourcelog CSV files and produces summary report.
  4. Website: Website dev-mode as well as build + upload.

Getting started

Mempool Collector

  1. Subscribes to new pending transactions at various data sources
  2. Writes timestamp_ms + hash + raw_tx to CSV file (one file per hour by default)
  3. Note: the collector can store transactions repeatedly, and only the merger will properly deduplicate them later

Default filenames:

Transactions

  • Schema: <out_dir>/<date>/transactions/txs_<date>_<uid>.csv
  • Example: out/2023-08-07/transactions/txs_2023-08-07-10-00_collector1.csv

Sourcelog

  • Schema: <out_dir>/<date>/sourcelog/src_<date>_<uid>.csv
  • Example: out/2023-08-07/sourcelog/src_2023-08-07-10-00_collector1.csv

Running the mempool collector:

# print help
go run cmd/collector/main.go -help

# Connect to ws://localhost:8546 and write CSVs into ./out
go run cmd/collector/main.go -out ./out

# Connect to multiple nodes
go run cmd/collector/main.go -out ./out -nodes ws://server1.com:8546,ws://server2.com:8546

Merger

  • Iterates over collector output directory / CSV files
  • Deduplicates transactions, sorts them by timestamp
go run cmd/merge/main.go -h

Architecture

General design goals

  • Keep it simple and stupid
  • Vendor-agnostic (main flow should work on any server, independent of a cloud provider)
  • Downtime-resilience to minimize any gaps in the archive
  • Multiple collector instances can run concurrently, without getting into each others way
  • Merger produces the final archive (based on the input of multiple collector outputs)
  • The final archive:
    • Includes (1) parquet file with transaction metadata, and (2) compressed file of raw transaction CSV files
    • Compatible with Clickhouse and S3 Select (Parquet using gzip compression)
    • Easily distributable as torrent

Collector

  • NodeConnection
    • One for each EL connection
    • New pending transactions are sent to TxProcessor via a channel
  • TxProcessor
    • Check if it already processed that tx
    • Store it in the output directory

Merger

Transaction RLP format


Contributing

Install dependencies

go install mvdan.cc/gofumpt@latest
go install honnef.co/go/tools/cmd/staticcheck@latest
go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest
go install github.com/daixiang0/gci@latest

Lint, test, format

make lint
make test
make fmt

Further notes


License

MIT


Maintainers