Export blocks and transactions (Schema, Reference):
> python export_blocks_and_transactions.py --start-block 0 --end-block 500000 \
--provider-uri https://mainnet.infura.io --blocks-output blocks.csv --transactions-output transactions.csv
Export ERC20 and ERC721 transfers (Schema, Reference):
> python export_token_transfers.py --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output token_transfers.csv
Export receipts and logs (Schema, Reference):
> python export_receipts_and_logs.py --transaction-hashes transaction_hashes.txt \
--provider-uri https://mainnet.infura.io --receipts-output receipts.csv --logs-output logs.csv
Export ERC20 and ERC721 token details (Schema, Reference):
> python export_tokens.py --token-addresses token_addresses.csv \
--provider-uri https://mainnet.infura.io --output tokens.csv
Read this article https://medium.com/@medvedev1088/exporting-and-analyzing-ethereum-blockchain-f5353414a94e
Column | Type |
---|---|
number | bigint |
hash | hex_string |
parent_hash | hex_string |
nonce | hex_string |
sha3_uncles | hex_string |
logs_bloom | hex_string |
transactions_root | hex_string |
state_root | hex_string |
receipts_root | hex_string |
miner | address |
difficulty | numeric |
total_difficulty | numeric |
size | bigint |
extra_data | hex_string |
gas_limit | bigint |
gas_used | bigint |
timestamp | bigint |
transaction_count | bigint |
Column | Type |
---|---|
hash | hex_string |
nonce | bigint |
block_hash | hex_string |
block_number | bigint |
transaction_index | bigint |
from_address | address |
to_address | address |
value | numeric |
gas | bigint |
gas_price | bigint |
input | hex_string |
Column | Type |
---|---|
token_address | address |
from_address | address |
to_address | address |
value | numeric |
transaction_hash | hex_string |
log_index | bigint |
block_number | bigint |
Column | Type |
---|---|
transaction_hash | hex_string |
transaction_index | bigint |
block_hash | hex_string |
block_number | bigint |
cumulative_gas_used | bigint |
gas_used | bigint |
contract_address | address |
root | hex_string |
status | bigint |
Column | Type |
---|---|
log_index | bigint |
transaction_hash | hex_string |
transaction_index | bigint |
block_hash | hex_string |
block_number | bigint |
address | address |
data | hex_string |
topics | string |
Column | Type |
---|---|
address | address |
bytecode | hex_string |
function_sighashes | string |
is_erc20 | boolean |
is_erc721 | boolean |
Column | Type |
---|---|
address | address |
symbol | string |
name | string |
decimals | bigint |
total_supply | numeric |
You can find column descriptions in schemas/gcp
Note: symbol
, name
, decimals
, total_supply
columns in tokens.csv
can have empty values in case the contract doesn't implement the corresponding methods
or implements it incorrectly (e.g. wrong return type).
Note: for the address
type all hex characters are lower-cased.
boolean
type can have 2 values: True
or False
.
-
Install python 3.5 or 3.6 https://www.python.org/downloads/
-
You can use Infura if you don't need ERC20 transfers (Infura doesn't support eth_getFilterLogs JSON RPC method). For that use
-p https://mainnet.infura.io
option for the commands below. If you need ERC20 transfers or want to export the data ~40 times faster, you will need to set up a local Ethereum node: -
Install geth https://github.com/ethereum/go-ethereum/wiki/Installing-Geth
-
Start geth. Make sure it downloaded the blocks that you need by executing
eth.syncing
in the JS console. You can export blocks belowcurrentBlock
, there is no need to wait until the full sync as the state is not needed (unless you also need contracts bytecode and token details). You can export blocks belowcurrentBlock
, there is no need to wait until the full sync as the state is not needed. -
Clone Ethereum ETL and install the dependencies:
> git clone https://github.com/medvedev1088/ethereum-etl.git > cd ethereum-etl > pip install -r requirements.txt
-
Export all:
> ./export_all.sh -h Usage: ./export_all.sh -s <start_block> -e <end_block> -b <batch_size> -p <provider_uri> [-o <output_dir>] > ./export_all.sh -s 0 -e 5499999 -b 100000 -p file://$HOME/Library/Ethereum/geth.ipc -o output
The result will be in the
output
subdirectory, partitioned in Hive style:output/blocks/start_block=00000000/end_block=00099999/blocks_00000000_00099999.csv output/blocks/start_block=00100000/end_block=00199999/blocks_00100000_00199999.csv ... output/transactions/start_block=00000000/end_block=00099999/transactions_00000000_00099999.csv ... output/token_transfers/start_block=00000000/end_block=00099999/token_transfers_00000000_00099999.csv ...
Should work with geth and parity, on Linux, Mac, Windows.
If you use Parity you should disable warp mode with --no-warp
option because warp mode
does not place all of the block or receipt data into the database https://wiki.parity.io/Getting-Synced
Tested with Python 3.6, geth 1.8.7, Ubuntu 16.04.4
If you see weird behavior, e.g. wrong number of rows in the CSV files or corrupted files, check this issue: blockchain-etl#28
You can use AWS Auto Scaling and Data Pipeline to reduce the exporting time to a few hours. Read this article for details https://medium.com/@medvedev1088/how-to-export-the-entire-ethereum-blockchain-to-csv-in-2-hours-for-10-69fef511e9a2
Additional steps:
-
Install Visual C++ Build Tools https://landinghub.visualstudio.com/visual-cpp-build-tools
-
Install Git Bash with Git for Windows https://git-scm.com/download/win
-
Run in Git Bash:
> ./export_all.sh -s 0 -e 999999 -b 100000 -p 'file:\\\\.\pipe\geth.ipc' -o output
- export_blocks_and_transactions.py
- export_token_transfers.py
- extract_token_transfers.py
- export_receipts_and_logs.py
- export_contracts.py
- export_tokens.py
- get_block_range_for_date.py
All the commands accept -h
parameter for help, e.g.:
> python export_blocks_and_transactions.py -h
usage: export_blocks_and_transactions.py [-h] [-s START_BLOCK] -e END_BLOCK
[-b BATCH_SIZE] --provider-uri PROVIDER_URI
[-w MAX_WORKERS]
[--blocks-output BLOCKS_OUTPUT]
[--transactions-output TRANSACTIONS_OUTPUT]
Export blocks and transactions.
For the --output
parameters the supported types are csv and json. The format type is inferred from the output file name.
> python export_blocks_and_transactions.py --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --blocks-output blocks.csv --transactions-output transactions.csv
Omit --blocks-output
or --transactions-output
options if you want to export only transactions/blocks.
You can tune --batch-size
, --max-workers
for performance.
The API used in this command is not supported by Infura, so you will need a local node. If you want to use Infura for exporting ERC20 transfers refer to extract_token_transfers.py
> python export_token_transfers.py --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --batch-size 100 --output token_transfers.csv
Include --tokens <token1> <token2>
to filter only certain tokens, e.g.
> python export_token_transfers.py --start-block 0 --end-block 500000 --provider-uri file://$HOME/Library/Ethereum/geth.ipc \
--output token_transfers.csv --tokens 0x86fa049857e0209aa7d9e616f7eb3b3b78ecfdb0 0x06012c8cf97bead5deae237070f9587f8e7a266d
You can tune --batch-size
, --max-workers
for performance.
First extract transaction hashes from transactions.csv
(Exported with export_blocks_and_transactions.py):
> python extract_csv_column.py --input transactions.csv --column transaction_hash --output transaction_hashes.txt
Then export receipts and logs:
> python export_receipts_and_logs.py --transaction-hashes transaction_hashes.txt \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --receipts-output receipts.csv --logs-output logs.csv
Omit --receipts-output
or --logs-output
options if you want to export only logs/receipts.
You can tune --batch-size
, --max-workers
for performance.
Upvote this feature request openethereum/parity-ethereum#9075, it will make receipts and logs export much faster.
First export receipt logs with export_receipts_and_logs.py.
Then extract transfers from the logs.csv file:
> python extract_token_transfers.py --logs logs.csv --output token_transfers.csv
You can tune --batch-size
, --max-workers
for performance.
First extract contract addresses from receipts.csv
(Exported with export_receipts_and_logs.py):
> python extract_csv_column.py --input receipts.csv --column contract_address --output contract_addresses.txt
Then export contracts:
> python export_contracts.py --contract-addresses contract_addresses.txt \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output contracts.csv
You can tune --batch-size
, --max-workers
for performance.
First extract token addresses from contracts.json
(Exported with export_contracts.py):
> python filter_items.py -i contracts.json -p "item['is_erc20'] or item['is_erc721']" | \
python extract_field.py -f address -o token_addresses.txt
Then export ERC20 / ERC721 tokens:
> python export_tokens.py --token-addresses token_addresses.txt \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output tokens.csv
You can tune --max-workers
for performance.
Upvote this pull request to make tokens export faster ethereum/web3.py#944 (comment)
> python get_block_range_for_date.py --provider-uri=https://mainnet.infura.io --date 2018-01-01
4832686,4838611
> export ETHEREUM_ETL_RUN_SLOW_TESTS=True
> pytest -vv
- Upload the files to S3:
> cd output
> aws s3 sync . s3://<your_bucket>/ethereumetl/export --region ap-southeast-1
-
Sign in to Athena https://console.aws.amazon.com/athena/home
-
Create a database:
CREATE DATABASE ethereumetl;
- Create the tables:
- blocks: schemas/aws/blocks.sql
- transactions: schemas/aws/transactions.sql
- token_transfers: schemas/aws/token_transfers.sql
- contracts: schemas/aws/contracts.sql
- receipts: schemas/aws/receipts.sql
- logs: schemas/aws/logs.sql
- tokens: schemas/aws/tokens.sql
Read this article on how to convert CSVs to Parquet https://medium.com/@medvedev1088/converting-ethereum-etl-files-to-parquet-399e048ddd30
- Create the tables:
- parquet_blocks: schemas/aws/parquet/parquet_blocks.sql
- parquet_transactions: schemas/aws/parquet/parquet_transactions.sql
- parquet_token_transfers: schemas/aws/parquet/parquet_token_transfers.sql
Note that DECIMAL type is limited to 38 digits in Hive https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-decimal so values greater than 38 decimals will be null.
To upload CSVs to BigQuery:
-
Install Google Cloud SDK https://cloud.google.com/sdk/docs/quickstart-debian-ubuntu
-
Create a new Google Storage bucket https://console.cloud.google.com/storage/browser
-
Upload the files:
> cd output
> gsutil -m rsync -r . gs://<your_bucket>/ethereumetl/export
-
Sign in to BigQuery https://bigquery.cloud.google.com/
-
Create a new dataset called
ethereum
-
Load the files from the bucket to BigQuery:
> cd ethereum-etl
> bq --location=US load --replace --source_format=CSV --skip_leading_rows=1 ethereum.blocks gs://<your_bucket>/ethereumetl/export/blocks/*.csv ./schemas/gcp/blocks.json
> bq --location=US load --replace --source_format=CSV --skip_leading_rows=1 ethereum.transactions gs://<your_bucket>/ethereumetl/export/transactions/*.csv ./schemas/gcp/transactions.json
> bq --location=US load --replace --source_format=CSV --skip_leading_rows=1 ethereum.token_transfers gs://<your_bucket>/ethereumetl/export/token_transfers/*.csv ./schemas/gcp/token_transfers.json
> bq --location=US load --replace --source_format=CSV --skip_leading_rows=1 ethereum.receipts gs://<your_bucket>/ethereumetl/export/receipts/*.csv ./schemas/gcp/receipts.json
> bq --location=US load --replace --source_format=NEWLINE_DELIMITED_JSON ethereum.logs gs://<your_bucket>/ethereumetl/export/logs/*.json ./schemas/gcp/logs.json
> bq --location=US load --replace --source_format=NEWLINE_DELIMITED_JSON ethereum.contracts gs://<your_bucket>/ethereumetl/export/contracts/*.json ./schemas/gcp/contracts.json
> bq --location=US load --replace --source_format=CSV --skip_leading_rows=1 --allow_quoted_newlines ethereum.tokens_duplicates gs://<your_bucket>/ethereumetl/export/tokens/*.csv ./schemas/gcp/tokens.json
Note that NEWLINE_DELIMITED_JSON is used to support REPEATED mode for the columns with lists.
Join transactions
and receipts
:
> bq mk --table --description "Exported using https://github.com/medvedev1088/ethereum-etl" --time_partitioning_field timestamp_partition ethereum.transactions_join_receipts ./schemas/gcp/transactions_join_receipts.json
> bq --location=US query --replace --destination_table ethereum.transactions_join_receipts --use_legacy_sql=false "$(cat ./schemas/gcp/transactions_join_receipts.sql | tr '\n' ' ')"
Deduplicate tokens
:
> bq mk --table --description "Exported using https://github.com/medvedev1088/ethereum-etl" ethereum.tokens ./schemas/gcp/tokens.json
> bq --location=US query --replace --destination_table ethereum.tokens --use_legacy_sql=false "$(cat ./schemas/gcp/tokens_deduplicate.sql | tr '\n' ' ')"
You can query the data that I exported in the public BigQuery dataset https://medium.com/@medvedev1088/ethereum-blockchain-on-google-bigquery-283fb300f579
I'm currently working on a SaaS solution for analysts and developers. The MVP will have the following:
- Built on top of AWS, cost efficient
- Can provide access to raw CSV data if needed
- Support for internal transactions in the future
- Support for Bitcoin and other blockchains in the future
- ERC20 token metrics in the future
Contact me if you would like to contribute evge.medvedev@gmail.com