Ethereum ETL
Export blocks and transactions (Schema, Reference):
> python export_blocks_and_transactions.py --start-block 0 --end-block 500000 \
--provider-uri https://mainnet.infura.io --blocks-output blocks.csv --transactions-output transactions.csv
Export ERC20 and ERC721 transfers (Schema, Reference):
> python export_token_transfers.py --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output token_transfers.csv
Export receipts and logs (Schema, Reference):
> python export_receipts_and_logs.py --transaction-hashes transaction_hashes.txt \
--provider-uri https://mainnet.infura.io --receipts-output receipts.csv --logs-output logs.csv
Export ERC20 and ERC721 token details (Schema, Reference):
> python export_tokens.py --token-addresses token_addresses.csv \
--provider-uri https://mainnet.infura.io --output tokens.csv
Export traces (Schema, Reference):
> python export_traces.py --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/parity.ipc --output traces.csv
Export geth traces (Reference):
> python export_geth_traces.py --start-block 1 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output geth_traces.json
Extract geth traces (Schema, Reference):
> python extract_geth_traces.py --input geth_traces.json --output traces.csv
Table of Contents
Schema
blocks.csv
Column | Type |
---|---|
number | bigint |
hash | hex_string |
parent_hash | hex_string |
nonce | hex_string |
sha3_uncles | hex_string |
logs_bloom | hex_string |
transactions_root | hex_string |
state_root | hex_string |
receipts_root | hex_string |
miner | address |
difficulty | numeric |
total_difficulty | numeric |
size | bigint |
extra_data | hex_string |
gas_limit | bigint |
gas_used | bigint |
timestamp | bigint |
transaction_count | bigint |
transactions.csv
Column | Type |
---|---|
hash | hex_string |
nonce | bigint |
block_hash | hex_string |
block_number | bigint |
transaction_index | bigint |
from_address | address |
to_address | address |
value | numeric |
gas | bigint |
gas_price | bigint |
input | hex_string |
token_transfers.csv
Column | Type |
---|---|
token_address | address |
from_address | address |
to_address | address |
value | numeric |
transaction_hash | hex_string |
log_index | bigint |
block_number | bigint |
receipts.csv
Column | Type |
---|---|
transaction_hash | hex_string |
transaction_index | bigint |
block_hash | hex_string |
block_number | bigint |
cumulative_gas_used | bigint |
gas_used | bigint |
contract_address | address |
root | hex_string |
status | bigint |
logs.csv
Column | Type |
---|---|
log_index | bigint |
transaction_hash | hex_string |
transaction_index | bigint |
block_hash | hex_string |
block_number | bigint |
address | address |
data | hex_string |
topics | string |
contracts.csv
Column | Type |
---|---|
address | address |
bytecode | hex_string |
function_sighashes | string |
is_erc20 | boolean |
is_erc721 | boolean |
tokens.csv
Column | Type |
---|---|
address | address |
symbol | string |
name | string |
decimals | bigint |
total_supply | numeric |
traces.csv
Column | Type |
---|---|
block_number | bigint |
transaction_hash | hex_string |
transaction_index | bigint |
from_address | address |
to_address | address |
value | numeric |
contract_address | address |
input | hex_string |
output | hex_string |
trace_type | string |
call_type | string |
reward_type | string |
gas | bigint |
gas_used | bigint |
subtraces | bigint |
trace_address | string |
error | string |
You can find column descriptions in https://github.com/medvedev1088/ethereum-etl-airflow
Note: for the address
type all hex characters are lower-cased.
boolean
type can have 2 values: True
or False
.
LIMITATIONS
contracts.csv
andtokens.csv
files don’t include contracts created by message calls (a.k.a. internal transactions). We are working on adding support for those.- In case the contract is a proxy, which forwards all calls to a delegate, interface detection doesn’t work,
which means
is_erc20
andis_erc721
will always be false for proxy contracts. - The metadata methods (
symbol
,name
,decimals
,total_supply
) for ERC20 are optional, so around 10% of the contracts are missing this data. Also some contracts (EOS) implement these methods but with wrong return type, so the metadata columns are missing in this case as well. token_transfers.value
,tokens.decimals
andtokens.total_supply
have typeSTRING
in BigQuery tables, because numeric types there can't handle 32-byte integers. You should usecast(value as FLOAT64)
(possible loss of precision) orsafe_cast(value as NUMERIC)
(possible overflow) to convert to numbers.- The contracts that don't implement
decimals()
function but have the fallback function that returns aboolean
will have0
or1
in thedecimals
column in the CSVs.
Differences between geth and parity traces.csv
to_address
field differs forcallcode
trace (geth seems to return correct value, as parity value ofto_address
is same asto_address
of parent call);- geth output doesn't have
reward
traces; - geth output doesn't have
to_address
,from_address
,value
for `suicide traces; error
field contains human readable error message, which might differ in geth/parity output;- geth output doesn't have
transaction_hash
; gas_used
is 0 on traces with error in geth, empty in parity;- zero output of subcalls is
0x000...
in geth,0x
in parity;
Exporting the Blockchain
-
Install python 3.6 https://www.python.org/downloads/ (3.5 and 3.7 are not supported by this tool for now)
-
You can use Infura if you don't need ERC20 transfers (Infura doesn't support eth_getFilterLogs JSON RPC method). For that use
-p https://mainnet.infura.io
option for the commands below. If you need ERC20 transfers or want to export the data ~40 times faster, you will need to set up a local Ethereum node: -
Install geth https://github.com/ethereum/go-ethereum/wiki/Installing-Geth
-
Start geth. Make sure it downloaded the blocks that you need by executing
eth.syncing
in the JS console. You can export blocks belowcurrentBlock
, there is no need to wait until the full sync as the state is not needed (unless you also need contracts bytecode and token details; for those you need to wait until the full sync). -
Clone Ethereum ETL and install the dependencies:
> git clone https://github.com/medvedev1088/ethereum-etl.git > cd ethereum-etl > pip install -r requirements.txt
-
Export all:
> ./export_all.sh -h Usage: ./export_all.sh -s <start_block> -e <end_block> -b <batch_size> -p <provider_uri> [-o <output_dir>] > ./export_all.sh -s 0 -e 5499999 -b 100000 -p file://$HOME/Library/Ethereum/geth.ipc -o output
The result will be in the
output
subdirectory, partitioned in Hive style:output/blocks/start_block=00000000/end_block=00099999/blocks_00000000_00099999.csv output/blocks/start_block=00100000/end_block=00199999/blocks_00100000_00199999.csv ... output/transactions/start_block=00000000/end_block=00099999/transactions_00000000_00099999.csv ... output/token_transfers/start_block=00000000/end_block=00099999/token_transfers_00000000_00099999.csv ...
Should work with geth and parity, on Linux, Mac, Windows.
If you use Parity you should disable warp mode with --no-warp
option because warp mode
does not place all of the block or receipt data into the database https://wiki.parity.io/Getting-Synced
Tested with Python 3.6, geth 1.8.7, Ubuntu 16.04.4
If you see weird behavior, e.g. wrong number of rows in the CSV files or corrupted files, check this issue: blockchain-etl#28
Export in 2 Hours
You can use AWS Auto Scaling and Data Pipeline to reduce the exporting time to a few hours. Read this article for details https://medium.com/@medvedev1088/how-to-export-the-entire-ethereum-blockchain-to-csv-in-2-hours-for-10-69fef511e9a2
Running in Windows
Additional steps:
-
Install Visual C++ Build Tools https://landinghub.visualstudio.com/visual-cpp-build-tools
-
Install Git Bash with Git for Windows https://git-scm.com/download/win
-
Run in Git Bash:
> ./export_all.sh -s 0 -e 999999 -b 100000 -p 'file:\\\\.\pipe\geth.ipc' -o output
Running in Docker
-
Install Docker https://docs.docker.com/install/
-
Build a docker image
> docker build -t ethereum-etl:latest . > docker image ls
-
Run a container out of the image
> docker run -v $HOME/output:/ethereum-etl/output ethereum-etl:latest -s 0 -e 5499999 -b 100000 -p https://mainnet.infura.io > docker run -v $HOME/output:/ethereum-etl/output ethereum-etl:latest -s 2018-01-01 -e 2018-01-01 -b 100000 -p https://mainnet.infura.io
Command Reference
- export_blocks_and_transactions.py
- export_token_transfers.py
- extract_token_transfers.py
- export_receipts_and_logs.py
- export_contracts.py
- export_tokens.py
- export_traces.py
- export_geth_traces.py
- extract_geth_traces.py
- get_block_range_for_date.py
- get_keccak_hash.py
All the commands accept -h
parameter for help, e.g.:
> python export_blocks_and_transactions.py -h
Usage: export_blocks_and_transactions.py [OPTIONS]
Export blocks and transactions.
Options:
-s, --start-block INTEGER Start block
-e, --end-block INTEGER End block [required]
-b, --batch-size INTEGER The number of blocks to export at a time.
-p, --provider-uri TEXT The URI of the web3 provider e.g.
file://$HOME/Library/Ethereum/geth.ipc or
https://mainnet.infura.io
-w, --max-workers INTEGER The maximum number of workers.
--blocks-output TEXT The output file for blocks. If not provided
blocks will not be exported. Use "-" for stdout
--transactions-output TEXT The output file for transactions. If not
provided transactions will not be exported. Use
"-" for stdout
-h, --help Show this message and exit.
For the --output
parameters the supported types are csv and json. The format type is inferred from the output file name.
export_blocks_and_transactions.py
> python export_blocks_and_transactions.py --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc \
--blocks-output blocks.csv --transactions-output transactions.csv
Omit --blocks-output
or --transactions-output
options if you want to export only transactions/blocks.
You can tune --batch-size
, --max-workers
for performance.
export_token_transfers.py
The API used in this command is not supported by Infura, so you will need a local node. If you want to use Infura for exporting ERC20 transfers refer to extract_token_transfers.py
> python export_token_transfers.py --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --batch-size 100 --output token_transfers.csv
Include --tokens <token1> --tokens <token2>
to filter only certain tokens, e.g.
> python export_token_transfers.py --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output token_transfers.csv \
--tokens 0x86fa049857e0209aa7d9e616f7eb3b3b78ecfdb0 --tokens 0x06012c8cf97bead5deae237070f9587f8e7a266d
You can tune --batch-size
, --max-workers
for performance.
export_receipts_and_logs.py
First extract transaction hashes from transactions.csv
(Exported with export_blocks_and_transactions.py):
> python extract_csv_column.py --input transactions.csv --column hash --output transaction_hashes.txt
Then export receipts and logs:
> python export_receipts_and_logs.py --transaction-hashes transaction_hashes.txt \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --receipts-output receipts.csv --logs-output logs.csv
Omit --receipts-output
or --logs-output
options if you want to export only logs/receipts.
You can tune --batch-size
, --max-workers
for performance.
Upvote this feature request openethereum/parity-ethereum#9075, it will make receipts and logs export much faster.
extract_token_transfers.py
First export receipt logs with export_receipts_and_logs.py.
Then extract transfers from the logs.csv file:
> python extract_token_transfers.py --logs logs.csv --output token_transfers.csv
You can tune --batch-size
, --max-workers
for performance.
export_contracts.py
First extract contract addresses from receipts.csv
(Exported with export_receipts_and_logs.py):
> python extract_csv_column.py --input receipts.csv --column contract_address --output contract_addresses.txt
Then export contracts:
> python export_contracts.py --contract-addresses contract_addresses.txt \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output contracts.csv
You can tune --batch-size
, --max-workers
for performance.
export_tokens.py
First extract token addresses from contracts.json
(Exported with export_contracts.py):
> python filter_items.py -i contracts.json -p "item['is_erc20'] or item['is_erc721']" | \
python extract_field.py -f address -o token_addresses.txt
Then export ERC20 / ERC721 tokens:
> python export_tokens.py --token-addresses token_addresses.txt \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output tokens.csv
You can tune --max-workers
for performance.
export_traces.py
The API used in this command is not supported by Infura,
so you will need a local Parity archive node (parity --tracing on
).
> python export_traces.py --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/parity.ipc --batch-size 100 --output traces.csv
You can tune --batch-size
, --max-workers
for performance.
export_geth_traces.py
The API used in this command is not supported by Infura,
so you will need a local Geth archive node (geth --gcmode archive --syncmode full
).
> python export_geth_traces.py --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --batch-size 100 --output geth_traces.json
You can tune --batch-size
, --max-workers
for performance.
extract_geth_traces.py
> python extract_geth_traces.py --input geth_traces.json --output traces.csv
You can tune --batch-size
, --max-workers
for performance.
get_block_range_for_date.py
> python get_block_range_for_date.py --provider-uri=https://mainnet.infura.io --date 2018-01-01
4832686,4838611
get_keccak_hash.py
> python get_keccak_hash.py -i "transfer(address,uint256)"
0xa9059cbb2ab09eb219583f4a59a5d0623ade346d962bcd4e46b11da047c9049b
Running Tests
> export ETHEREUM_ETL_RUN_SLOW_TESTS=True
> pytest -vv
Querying in Amazon Athena
- Upload the files to S3:
> cd output
> aws s3 sync . s3://<your_bucket>/ethereumetl/export --region ap-southeast-1
-
Sign in to Athena https://console.aws.amazon.com/athena/home
-
Create a database:
CREATE DATABASE ethereumetl;
- Create the tables:
- blocks: schemas/aws/blocks.sql
- transactions: schemas/aws/transactions.sql
- token_transfers: schemas/aws/token_transfers.sql
- contracts: schemas/aws/contracts.sql
- receipts: schemas/aws/receipts.sql
- logs: schemas/aws/logs.sql
- tokens: schemas/aws/tokens.sql
Tables for Parquet Files
Read this article on how to convert CSVs to Parquet https://medium.com/@medvedev1088/converting-ethereum-etl-files-to-parquet-399e048ddd30
- Create the tables:
- parquet_blocks: schemas/aws/parquet/parquet_blocks.sql
- parquet_transactions: schemas/aws/parquet/parquet_transactions.sql
- parquet_token_transfers: schemas/aws/parquet/parquet_token_transfers.sql
Note that DECIMAL type is limited to 38 digits in Hive https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-decimal so values greater than 38 decimals will be null.
Querying in Google BigQuery
Refer to https://github.com/medvedev1088/ethereum-etl-airflow for the instructions.
Public Dataset
You can query the data that's updated daily in the public BigQuery dataset https://medium.com/@medvedev1088/ethereum-blockchain-on-google-bigquery-283fb300f579