blocks.csv
Column | Type |
---|---|
block_number | bigint |
block_hash | hex_string |
block_parent_hash | hex_string |
block_nonce | hex_string |
block_sha3_uncles | hex_string |
block_logs_bloom | hex_string |
block_transactions_root | hex_string |
block_state_root | hex_string |
block_miner | hex_string |
block_difficulty | bigint |
block_total_difficulty | bigint |
block_size | bigint |
block_extra_data | hex_string |
block_gas_limit | bigint |
block_gas_used | bigint |
block_timestamp | bigint |
block_transaction_count | bigint |
transactions.csv
Column | Type |
---|---|
tx_hash | hex_string |
tx_nonce | bigint |
tx_block_hash | hex_string |
tx_block_number | bigint |
tx_index | bigint |
tx_from | hex_string |
tx_to | hex_string |
tx_value | bigint |
tx_gas | bigint |
tx_gas_price | bigint |
tx_input | hex_string |
erc20_transfers.csv
Column | Type |
---|---|
erc20_token | hex_string |
erc20_from | hex_string |
erc20_to | hex_string |
erc20_value | bigint |
erc20_tx_hash | hex_string |
erc20_block_number | bigint |
Run in the terminal:
> pip install typing Scrapy
> scrapy runspider ethscraper/spiders/eth_json_rpc_spider.py \
-s ETH_JSON_RPC_URL=https://mainnet.infura.io/<your_api_key> \
-s START_BLOCK=0 \
-s END_BLOCK=1000000 \
-s FEED_FORMAT=csv
The output will be in
blocks.csv
,
transactions.csv
,
erc20_transfers.csv
in the current directory.
The Ethereum node JSON RPC url.
If running a local geth node start it with --rpc
option:
geth --rpc --rpcapi eth
Then use ETH_JSON_RPC_URL=http://localhost:8545
.
Integers representing the start and end blocks for scraping, inclusive.
Output format. The output files will have the corresponding extension.
Supported formats are:
csv
,
xml
,
json
,
jsonlines
,
pickle
,
marshal
.
Whether to export transactions.csv
file.
Possible values: True
, False
.
Whether to export erc20_transfers.csv
file.
Possible values: True
, False
.
The number of concurrent requests. Default is 20
.
How many times to retry a request in case an error is encountered. Default is 10
.
Retrieving internal transactions requires transaction tracing. Since that's potentially a very long running operation (hours) and can also result in huge amounts of data, an IPC subscription should be used instead of RPC.
An example is given in this PR ethereum/go-ethereum#15516
$ nc -U /work/temp/rinkeby/geth.ipc
{"id": 1, "method": "debug_subscribe", "params": ["traceChain", "0x0", "0xffff", {"tracer": "callTracer"}]}
The API will stream back one RPC notification per non-empty block. An exception is the very last block, which will be reported even if empty so the user knows the stream is done.
{"jsonrpc":"2.0","id":1,"result":"0xe1deecc4b399e5fd2b2a8abbbc4624e2"}
{"jsonrpc":"2.0","method":"debug_subscription","params":{"subscription":"0xe1deecc4b399e5fd2b2a8abbbc4624e2","result":{"block":"0x37","hash":"0xdb16f0d4465f2fd79f10ba539b169404a3e026db1be082e7fd6071b4c5f37db7","traces":[{"from":"0x31b98d14007bdee637298086988a0bbd31184523","gas":"0x0","gasUsed":"0x0","input":"0x","output":"0x","time":"1.077µs","to":"0x2ed530faddb7349c1efdbf4410db2de835a004e4","type":"CALL","value":"0xde0b6b3a7640000"}]}}}
{"jsonrpc":"2.0","method":"debug_subscription","params":{"subscription":"0xe1deecc4b399e5fd2b2a8abbbc4624e2","result":{"block":"0xf43","hash":"0xacb74aa08838896ad60319bce6e07c92edb2f5253080eb3883549ed8f57ea679","traces":[{"from":"0x31b98d14007bdee637298086988a0bbd31184523","gas":"0x0","gasUsed":"0x0","input":"0x","output":"0x","time":"1.568µs","to":"0xbedcf417ff2752d996d2ade98b97a6f0bef4beb9","type":"CALL","value":"0xde0b6b3a7640000"}]}}}
{"jsonrpc":"2.0","method":"debug_subscription","params":{"subscription":"0xe1deecc4b399e5fd2b2a8abbbc4624e2","result":{"block":"0xf47","hash":"0xea841221179e37ca9cc23424b64201d8805df327c3296a513e9f1fe6faa5ffb3","traces":[{"from":"0xbedcf417ff2752d996d2ade98b97a6f0bef4beb9","gas":"0x4687a0","gasUsed":"0x12e0d","input":"0x6060604052341561000c57fe5b5b6101828061001c6000396000f30060606040526000357c0100000000000000000000000000000000000000000000000000000000900463ffffffff168063230925601461003b575bfe5b341561004357fe5b61008360048080356000191690602001909190803560ff1690602001909190803560001916906020019091908035600019169060200190919050506100c5565b604051808273ffffffffffffffffffffffffffffffffffffffff1673ffffffffffffffffffffffffffffffffffffffff16815260200191505060405180910390f35b6000600185858585604051806000526020016040526000604051602001526040518085600019166000191681526020018460ff1660ff1681526020018360001916600019168152602001826000191660001916815260200194505050505060206040516020810390808403906000866161da5a03f1151561014257fe5b50506020604051035190505b9493505050505600a165627a7a7230582054abc8e7b2d8ea0972823aa9f0df23ecb80ca0b58be9f31b7348d411aaf585be0029","output":"0x60606040526000357c0100000000000000000000000000000000000000000000000000000000900463ffffffff168063230925601461003b575bfe5b341561004357fe5b61008360048080356000191690602001909190803560ff1690602001909190803560001916906020019091908035600019169060200190919050506100c5565b604051808273ffffffffffffffffffffffffffffffffffffffff1673ffffffffffffffffffffffffffffffffffffffff16815260200191505060405180910390f35b6000600185858585604051806000526020016040526000604051602001526040518085600019166000191681526020018460ff1660ff1681526020018360001916600019168152602001826000191660001916815260200194505050505060206040516020810390808403906000866161da5a03f1151561014257fe5b50506020604051035190505b9493505050505600a165627a7a7230582054abc8e7b2d8ea0972823aa9f0df23ecb80ca0b58be9f31b7348d411aaf585be0029","time":"658.529µs","to":"0x5481c0fe170641bd2e0ff7f04161871829c1902d","type":"CREATE","value":"0x0"}]}}}
{"jsonrpc":"2.0","method":"debug_subscription","params":{"subscription":"0xe1deecc4b399e5fd2b2a8abbbc4624e2","result":{"block":"0xfff","hash":"0x254ccbc40eeeb183d8da11cf4908529f45d813ef8eefd0fbf8a024317561ac6b"}}}
Individual block tracing is concurrent in the transactions (limited to num cores) and also makes chain tracing concurrent in the blocks (limited to num cores).
To scrape contract bytecode and Solidity code from Etherscan:
> pip install Scrapy
> scrapy runspider ethscraper/spiders/etherscan_contract_spider.py -o data.csv
Note that CloudFlare will block your machine after a few thousand requests. Be aware that web scraping is considered bad practice. This can break without notice, as it is obviously relying on how the frontend is rendered.