BERT4ETH (PyTorch Version)

This is the PyTorch implementation for the paper BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection, accepted by the ACM Web conference (WWW) 2023.

I finished the first draft and will test the performance soon. 2023/11/17

Getting Start

Requirements

Preprocess dataset

Step 1: Download dataset from Google Drive.

The master branch hosts the basic BERT4ETH model. If you wish to run the basic BERT4ETH model, there is no need to download the ERC-20 log dataset. Advanced features such as In/out separation and ERC20 log can be found in the old branch.

Step 2: Unzip dataset under the directory of "BERT4ETH/Data/"

cd BERT4ETH/Data; # Labels are already included
unzip ...;

Step 3: Transaction Sequence Generation

cd Model;
python gen_seq.py --bizdate=bert4eth_exp

Pre-training

Step 1: Pre-train BERT4ETH

python run_pretrain.py --bizdate=bert4eth_exp \
                       --max_seq_length=100 \
                       --epoch=5 \
                       --batch_size=256 \
                       --learning_rate=1e-4 \
                       --num_train_steps=1000000 \
                       --save_checkpoints_steps=8000 \
                       --neg_strategy=zip \
                       --neg_sample_num=5000 \ 
                       --neg_share=True \ 
                       --checkpointDir=bert4eth_exp

Parameter	Description
`bizdate`	The signature for this experiment run.
`max_seq_length`	The maximum length of BERT4ETH.
`masked_lm_prob`	The probability of masking an address.
`epochs`	Number of training epochs, default = `5`.
`batch_size`	Batch size, default = `256`.
`learning_rate`	Learning rate for the optimizer (Adam), default = `1e-4`.
`num_train_steps`	The maximum number of training steps, default = `1000000`,
`save_checkpoints_steps`	The parameter controlling the step of saving checkpoints, default = `8000`.
`neg_strategy`	Strategy for negative sampling, default `zip`, options (`uniform`, `zip`, `freq`).
`neg_share`	Whether enable in-batch sharing strategy, default = `True`.
`neg_sample_num`	The negative sampling number for one batch, default = `5000`.
`checkpointDir`	Specify the directory to save the checkpoints.

Step 2: Output Representation

python output_embed.py --bizdate=bert4eth_exp \
                       --init_checkpoint=bert4eth_exp/model_104000 \
                       --max_seq_length=100 \
                       --neg_sample_num=5000 \
                       --neg_strategy=zip \
                       --neg_share=True

I have generated a version of embedding file, you can unzip it under the directory of "Model/inter_data/" and test the results.

Testing on output account representation

Phishing Account Detection

python run_phishing_detection.py --init_checkpoint=bert4eth_exp/model_104000 # Random Forest (RF)

python run_phishing_detection_dnn.py --init_checkpoint=bert4eth_exp/model_104000 # DNN, better than RF

De-anonymization (ENS dataset)

python run_dean_ENS.py --metric=euclidean \
                       --init_checkpoint=bert4eth_exp/model_104000

Fine-tuning for phishing account detection

python gen_finetune_phisher_data.py --bizdate=bert4eth_exp \ 
                                    --max_seq_length=100

python run_finetune_phisher.py --init_checkpoint=bert4eth_exp/model_104000 \
                               --bizdate=bert4eth_exp \ 
                               --max_seq_length=100 \ 
                               --checkpointDir=tmp

Citation

If you find this repository useful, please give us a star and cite our paper : ) Thank you!

@inproceedings{hu2023bert4eth,
  title={BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection},
  author={Hu, Sihao and Zhang, Zhen and Luo, Bingqiao and Lu, Shengliang and He, Bingsheng and Liu, Ling},
  booktitle={Proceedings of the ACM Web Conference 2023},
  pages={2189--2197},
  year={2023}
}

Q&A

If you have any questions, you can either open an issue or contact me (sihaohu@gatech.edu), and I will reply as soon as I see the issue or email.

fushun9675/BERT4ETH_PyTorch