/BinMoCo

Source code of BinMoCo: Improving Binary Code Similarity Detection with Hard Sample-Aware Momentum Contrastive Learning

Primary LanguagePython

BinMoCo

Source code of BinMoCo: Improving Binary Code Similarity Detection with Hard Sample-Aware Momentum Contrastive Learning.

Environment

radare2 5.7.8 + r2pipe 1.7.4
torch-geometric 2.1.0
transformers 4.28.1
pytorch 1.12.1
python 3.10.6
lightning 2.0.2
connectorx 0.3.1
faiss-gpu 1.7.2

Train and Evaluate (BFS dataset)

The same procedure can be applied to BINKIT dataset which can be downloaded here.

  1. Download the BFS dataset here, which is proposed in paper How Machine Learning Is Solving the Binary Function Similarity Problem

  2. Build BFS Database

python code/build_db.py --db_name sec22
  • This will generate db.sqlite in database/sec22
  1. Build function groups
python code/build_group.py database/sec22
  1. Build vocabs
python code/build_vocab.py database/sec22
python code/build_imp_vocab.py database/sec22
  1. Train BinMoCo
python code/train.py \
    --data_dir database/sec22 \
    --data_repr cfg_cg \
    --num_edge_type 3 \
    --embedding_dims 128 \
    --seq_model transformer \
    --seq_hidden_dims 128 \
    --seq_layers 4 \
    --gnn_name gatedgcn \
    --gnn_hidden_dims 128 \
    --gnn_layers 5 \
    --gnn_out_dims 128 \
    --train_batch_size 30 \
    --batch_k 5 \
    --val_batch_size 30 \
    --train_num_each_epoch 300000 \
    --num_epochs 30 \
    --num_workers 12 \
    --learning_rate 0.001 \
    --miner_type ms \
    --loss_type ms \
    --use_moco \
    --memory_size 16384 \
    --early_stopping 15 \
    --precision 16 \
    --save_name sec22_cfg_cg_trans_gatedgcn_gin_ms_ms_moco
  1. Build test data (four tasks: XO, XC, XA, XM, with different Poolsizes)
python code/build_testdata.py database/sec22
  1. Test the trained model
python code/test.py results/dml/sec22_cfg_cg_trans_gatedgcn_gin_ms_ms_moco/version_0

Reference

We refer to the following repositories during implementation: