/BERT4ETH

BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection (WWW23)

Primary LanguagePython

BERT4ETH

This is the repo for the code and datasets used in the paper BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection, accepted by the ACM Web conference (WWW) 2023.

Here you can find our slides.

Getting Start

Requirements

  • Python >= 3.6.1
  • NumPy >= 1.12.1
  • TensorFlow >= 1.4.0

Preprocess dataset

Step 1: Download dataset from Google Drive.

Step 2: Unzip dataset under the directory of "BERT4ETH/Data/"

cd BERT4ETH/Data; # Labels are already included
unzip ...;

The total volume of unzipped dataset is quite huge (more than 15GB).

If you want to run the basic BERT4ETH model, it is no need to download the ERC-20 log dataset.

Advanced features (In/out separation and ERC20 log) make the model not very efficient..

Step 3: Transaction Sequence Generation

cd Model/bert4eth;
python gen_seq.py --phisher=True \
                  --deanon=True \ 
                  --mev=True \
                  --dup=True \
                  --dataset=1M \
                  --bizdate=bert4eth_1M_min3_dup

Pre-training

Step 0: Model Configuration

The configuration file is "Model/BERT4ETH/bert_config.json"

{
  "attention_probs_dropout_prob": 0.2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.2,
  "hidden_size": 64,
  "initializer_range": 0.02,
  "intermediate_size": 64,
  "max_position_embeddings": 50,
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "type_vocab_size": 2,
  "vocab_size": 3000000
}

Step 1: Pre-train Sequence Generation

python gen_pretrain_data.py --source_bizdate=bert4eth_1M_min3_dup  \
                            --bizdate=bert4eth_1M_min3_dup_seq100_mask80  \ 
                            --max_seq_length=100  \
                            --dupe_factor=10 \
                            --masked_lm_prob=0.8 \
                            --do_eval=False

Step 2: Pre-train BERT4ETH Model

python run_pretrain.py --bizdate=bert4eth_1M_min3_dup_seq100_mask80 \
                       --max_seq_length=100 \
                       --checkpointDir=bert4eth_1M_min3_dup_seq100_mask80_shared_zipfan5000 \
                       --epoch=5 \
                       --batch_size=256 \
                       --learning_rate=1e-4 \
                       --num_train_steps=1000000 \
                       --num_warmup_steps=100 \
                       --save_checkpoints_steps=8000 \
                       --neg_strategy=zip
                       --neg_sample_num=5000 
                       --neg_share=True
                       --init_seed=1234 
                       
Parameter Description
bizdate The signature for this experiment run.
max_seq_length The maximum length of BERT4ETH.
masked_lm_prob The probability of masking an address.
epochs Number of training epochs, default = 5.
batch_size Batch size, default = 256.
learning_rate Learning rate for the optimizer (Adam), default = 1e-4.
num_train_steps The maximum number of training steps, default = 1000000,
num_warmup_steps The step number for warm-up training, default = 100.
save_checkpoints_steps The parameter controlling the step of saving checkpoints, default = 8000.
neg_strategy Strategy for negative sampling, default zip, options (uniform, zip, freq).
neg_share Whether enable in-batch sharing strategy, default = True.
neg_sample_num The negative sampling number for one batch, default = 5000.
do_eval Whether to do evaluation during training, default = False.
checkpointDir Specify the directory to save the checkpoints.
init_seed The initial seed, default = 1234.

Step 3: Output Representation

python run_embed.py --bizdate=bert4eth_1M_min3_dup_seq100_mask80 \ 
                    --init_checkpoint=bert4eth_1M_min3_dup_seq100_mask80_shared_zipfan5000/model_104000 \  
                    --max_seq_length=100 \
                    --neg_sample_num=5000 \
                    --neg_strategy=zip \
                    --neg_share=True 

Testing on the account representation

Phishing Account Detection

cd BERT4ETH/Model;
python run_phishing_detection.py --algo=bert4eth \
                                 --model_index=XXX

De-anonymization (ENS dataset)

cd BERT4ETH/Model;
python run_dean_ENS.py --metric=euclidean \
                       --algo=bert4eth \
                       --model_index=XXX

De-anonymization (Tornado Cash)

cd BERT4ETH/Model;
python run_dean_Tornado.py --metric=euclidean \
                           --algo=bert4eth \
                           --model_index=XXX

Fine-tuning on the phishing account detection

cd BERT4ETH/Model;
python gen_finetune_phisher_data.py --bizdate=bert4eth_1M_min3_dup_seq100_mask80 \ 
                                    --source_bizdate=bert4eth_1M_min3_dup \
                                    --max_seq_length=100 
cd BERT4ETH/Model/BERT4ETH
python run_finetune_phisher.py --bizdate=bert4eth_1M_min3_dup_seq100_mask80 \ 
                               --max_seq_length=100 --checkpointDir=tmp

Citation

If you find this repository useful, please give us a star and cite our paper : ) Thank you!

@article{hu2023bert4eth,
  title={BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection},
  author={Hu, Sihao and Zhang, Zhen and Luo, Bingqiao and Lu, Shengliang and He, Bingsheng and Liu, Ling},
  journal={arXiv preprint arXiv:2303.18138},
  year={2023}
}

Q&A

If you have any questions, you can either open an issue or contact me (sihaohu@gatech.edu), and I will reply as soon as I see the issue or email.