/zksync-era-ETL

Best zkSync-era ETL ever 😜

Primary LanguagePython

zksync-era-ETL: on-chain data tool

zkevm

awesome GitHub contributors pull requests welcome badge Twitter

Introduction

As Ethereum continues to evolve, the role of Layer 2 (L2) solutions like rollups becomes increasingly pivotal. These innovations are crucial in reducing transaction costs on Ethereum, but they also present new challenges, such as fragmented liquidity. In this rapidly changing landscape, leading L2 platforms are gaining prominence, and I anticipate that in the near future, a select few will handle the majority of significant transactions.

In this regard, zkSync stands out as a potential leader. Its continuous optimization positions it alongside other major L2 solutions like Optimism and Arbitrum. Recognizing zkSync's potential to become a 'Super Rollup', I developed zkSync-ETL. This tool is designed for efficient and real-time access to on-chain data, a crucial need for developers and analysts in the Ethereum ecosystem.

zkSync-ETL is an ongoing project, and warmly welcome ideas, feedback, and contributions. We ensure it remains a valuable resource for anyone looking to leverage the power of zkSync in their Ethereum-based applications.

Architecture

High-Level

The zkSync-ETL is structured into two primary components: the /data module for data storage, and the /era module for specific data processing tasks.

Data Acquisition (/rpc Module): This module interfaces with the zkSync RPC, where running a local node is advisable (see external node documentation for guidance). It retrieves raw block and transaction data in JSON format.

Data Processing (/json Module): Within the json module, raw data undergoes cleaning and processing. This transforms it into comprehensively clean data, currently comprising seven core tables: accounts, balances, blocks, contracts, SyncSwap swaps, token transfers, and transactions. Future updates aim to include data from mainstream DEXs, NFTs, and derivative protocols.

  • accounts
  • balances
  • blocks
  • contracts
  • SyncSwap swaps
  • token transfers
  • transactions

Database Management (/db Module): The db module is responsible for creating PostgreSQL tables and data schemas. It imports all data in CSV format into these tables. This setup enables the development of custom data programs akin to Dune, Nansen, and The Graph, utilizing zkSync data. Additionally, these datasets can be instrumental in researching the Ethereum and zkSync ecosystems.

Low-Level

  • /data

    • /json_raw_data: Raw JSON data of blocks & transactions.
    • /json_clean_data: Clean JSON data of all tables.
    • /json_to_csv: Clean CSV data of all tables, prepare for import to PostgreSQL DB.
  • /era

    • /rpc: Get raw JSON data from zkSync RPC.

      • /fetch: Call to get raw blocks and transactions data.
      • /trace: Call to get raw trace data.
    • /json: Convert raw JSON data to clean JSON/CSV data, and plus applications cleaner.

      • /structures: Define the data structure of the base tables.
      • /resolver: A tool that assists in converting the base table from raw data to clean data..
      • /cleaner: Important module to convert all raw JSON data to clean JSON and CSV data. Parsing for more applications will also be encapsulated in this module.
    • /db: Module for importing data into a database.

      • /schemas: Define the data structure of all tables in the PostgreSQL database.
      • /exporter: Import clean CSV data from all tables into the database.
    • /setup: Some basic setup.

      • /config: Block ranges, file size, folder size, RPC URL, etc.
      • /tokens: Token addresses for balance data.
    • /utils: All the utils crates used as dependencies of the module crates above.

How to use it

Create a VENV:

Recommended to use ETL by creating a virtual environment.

# Create venv
brew install pyenv
pyenv virtualenv 3.11.4 myenv

# Active, two methods both works
pyenv activate myenv
source ~/.pyenv/versions/myenv/bin/activate

Setup

In the /setup module, configure the block range, folder size, and RPC URL for data retrieval.

Block Range: Select the specific range of blocks to source your on-chain data.

Files Size and Folder Size: By default, data is stored in units of 10,000 per file and 100,000 per folder. Adjust these settings based on your storage preferences.

RPC URL: While the default setting is the zkSync public RPC, considering performance constraints, it's advisable to use a local node. For setup details, please refer to the zkSync official team's guidance.

# Example
FILE_SIZE = 10000  # 10k
FOLDER_SIZE = 100000  # 100k

START_BLOCK = 0
END_BLOCK = 1000000  # block 0 to 999,999
BATCH_SIZE = 100
MULTI_BATCH_SIZE = 100

BALANCE_BATCH_SIZE = 10

RPC_URL = 'https://mainnet.era.zksync.io'

Data Processing Procedure

# Get raw data from RPC
python -m scroll.rpc.fetch.call

# Get clean data from raw data
python -m scroll.json.cleaner.all

# Create schemas for DB
python -m scroll.db.schemas.create

# Import all data into DB
python -m scroll.db.exporter.all

Contribution

Contributions of any kind are welcome! 🎉