/ethecycle

Primary LanguagePythonOtherNOASSERTION

Transaction Graph DB

  • Wallet addresses are vertices AKA nodes. All vertices currently use the label wallet so the .hasLabel() predicates seen in many tutorials are irrelevant and do not need to be used.
  • Transactions are edges and contain properties like value, block_number, etc. All edges use the label transaction so the same rule applies. Theoretically we could distinguish ETH txions from ERC20 txions from gas fees from swaps etc. down the road but for now they're just transaction.
  • Look in Graph for examples of queries that actually work.

Prerequisites

This is known to work with these versions of docker and docker-compose.

$ docker-compose --version
Docker Compose version v2.10.2

$ docker --version
Docker version 20.10.17, build 100c701

Usage

To launch docker containers and load graph data:

# Clone repo and cd into the repo directory
git clone https://github.com/michelcrypt4d4mus/ethecycle.git
cd ethecycle

# Generate ssh key pair your containers can use to talk to each other and create some .env files:
scripts/docker/container_file_setup.sh

# Edit local copy of .env to set TXION_DATA_DIR as the location of some txion CSVs
# NOTE: After the first build you may want to set REBUILD_CHAIN_ADDRESS_DB to avoid rebuilding
#       the chain addresses DB every time.
vi .env

# When you run this command docker-compose should build everything and leave you in a
# bash shell, at which point you can run 'bpython' to get a python REPL etc.
scripts/docker/python_etl/shell.sh

# Run script to get updated JVM settings for neo4j once you have allocated docker memory:
scripts/docker/neo4j/generate_.neo4j.env_file.sh  # Add -h for help

Loading Data

Once you are in the container shell the ./load_transactions.py script will prep CSVs for Neo4j's bulk loader and then load them (unless you specify --extract-only). This approach uses the neo4j-admin database import tooling (documentation) which is theoretically significantly faster than LOAD CSV at getting data from the disk and into Neo4j.

load_transactions.py takes a directory of CSVS or a single CSV, processes them to add some columns (e.g. token_symbol and blockchain), does decimal conversion where it can, etc., and writes 2 output CSVs for each input CSV (wallets, txns) output/ directory along with two one line CSVs (one each for the wallet and transaction headers).

How to run it:

# Show help:
./load_transactions.py --help

# First time you must run with --drop to overwrite the database called 'neo4j' (community edition limitation):
./load_transactions.py /path/to/transactions.csv --drop

# You can also run it against an entire directory of CSVs:
./load_transactions.py /path/to/transactions/ --drop

# Load only USDT txions:
./load_transactions.py /path/to/transactions.csv --token USDT --drop

# Perform the extraction and transformation but display load command on screen rather than actually execute it:
./load_transactions.py /path/to/transactions.csv --drop --extract-only

Example output:

Running From Outside Of Docker Container

Cannot guarantee these steps work but they probably will work.

  1. Create a virtual env in the project dir: python -m venv .venv
  2. Activate the venv: . .venv/bin/activate

Queries

Index Creation Queries

Some reasonable guesses as to useful ways to index transactions can be found here.

Other Useful Commands

Outside the Container

  • Get a shell on the Python ETL container: scripts/docker/python_etl/shell.sh
  • Get bpython REPL on the Python ETL container: scripts/docker/python_etl/bpython.sh
  • Get a shell on the test env Python ETL container: scripts/docker/python_etl/test_shell.sh
  • Get a shell on the Neo4j container: scripts/docker/neo4j/shell.sh
  • Generate .env.neo4j file (example) with Neo4j official recommendations: scripts/docker/neo4j/generate_.neo4j.env_file.sh -h
  • Build images with fresh build of chain address DB: REBUILD_CHAIN_ADDRESS_DB=freshly_built_address_db docker-compose build (you can also change the value in .env but this is quicker)

On the Container

  • Display the wallet tags: show_chain_addresses
  • Display known tokens: show_tokens
  • Connect to the chain address sqlite DB: chain_address_db
  • Print a query that can be run on Dune to find new wallet tags: dune_query
  • Reimport chain addresses database: ./import_chain_addresses.py -h (note that these won't persist on the image!)
  • Set the environment variable DEBUG=true when running commands to see various debug ouutput

Neo4j

IMPORTANT: The community edition only allows you to have one database per server and it must be called neo4j.

After starting you can browse to http://localhost:7474/browser/ to run queries. Alternatively (and more 'performantly') Neo4j makes a desktop application.

Running Queries

  • Addresses start with 0x (same as etherscan)
  • All addresses in the DB are lowercased, so make sure to use toLower() on an address of mixed/upper case.
  • Occasionally Neo4j from docker messes up the permissions. If that happens it may help to get on the container and run
    cd /var/lib/neo4j/data
    sudo chown -R neo4j:neo4j databases/
    sudo chown -R neo4j:neo4j transactions/

Other Neo4J Resources

ETL Related Resources

Questions

  1. IIRC you said the txion amounts were already correctly adjusted for decimals? (AKA divided by 10^18 for most tokens)
  2. Current unique ID for edge is transaction_id = f"{self.transaction_hash}-{self.log_index}". Does that make sense?
  3. Do you have a rough estimate as far as blocks per hour and/or blocks per day?

Potential Queries / TODO

  1. Identify the largest short term pass through wallets (AKA wallets with large xfers in and out in a short time frame that end up w/0 balances and are not used again)
  2. More address sources:
    1. https://github.com/CryptoScamDB/blacklist/blob/master/data/urls.yaml
    2. Stolen OpenSea NFTs
    3. Ethereum Name Service

Other Technologies

  • ArangoDB - Second most commonly recommended after Neo4j.
  • Apache AGE - Postgres extension. No Tinkerpop support, only OpenCypher.
  • ArcadeDB - New fork of OrientDB. Gremlin and OpenCypher support.
  • MemGraph - In memory graph DB.
  • TigerGraph comes up sometimes