/binshot

Primary LanguagePythonMIT LicenseMIT

BinShot

This is the official repository for BinShot (ACSAC 22'), which is a practical binary code similarity detection tool with BERT-based transferable similarity learning.

Requirements

pip install -r requirements.txt

Besides, you need to install the followings:

  • python3 (tested on 3.8)
  • IDA Pro (tested on 7.6)
  • pytorch (tested on 1.11)

Run codes with published data

Download published data

The proprocessed data used in our paper can be found by following google drive link:
https://bit.ly/3xov03n
Download the data, and then move them into "corpus" directory.

Pretraining

python3 bert_mlm.py \
            -cd corpus/pretrain.all.corpus.txt \
            -vp corpus/pretrain.all.corpus.voca \
            -op models/pretrain

Finetuning & Evaluation

python3 binshot.py \
            -bm models/pretrain/model_bert/bert_ep19.model \
            -vp corpus/pretrain.all.corpus.voca \
            -op models/downstream \
            -r all \
            -tn corpus/binsim.all.train.corpus.txt \
            -vd corpus/binsim.all.valid.corpus.txt \
            -tt corpus/binsim.all.test.corpus.txt

To get metrics across different compilers and optimization levels (e.g., clangO0 & gccO2), run following command:

python result.py -s models/downstream/pred.test.all_all -v models/downstream

The result will be written in the json file placed in models/downstream (-v option).

Transferability Evaluation

python3 binshot.py \
            -bm models/pretrain/model_bert/bert_ep19.model \
            -vp corpus/pretrain.all.corpus.voca \
            -op models/spec06 \
            -r spec06 \
            -tn corpus/binsim.spec06.train.corpus.txt \
            -vd corpus/binsim.spec06.valid.corpus.txt \
            -tt corpus/binsim.spec06.test.corpus.txt

python3 binshot.py \
            -bm models/pretrain/model_bert/bert_ep19.model \
            -vp corpus/pretrain.all.corpus.voca \
            -op models/spec17 \
            -r spec17 \
            -tn corpus/binsim.spec17.train.corpus.txt \
            -vd corpus/binsim.spec17.valid.corpus.txt \
            -tt corpus/binsim.spec17.test.corpus.txt

python3 binshot.py \
            -bm models/[spec06,spec17]/model_sim/bert_ep19.model \
            -fm models/[spec06,spec17]/model_sim/sim_ep19.model \
            -vp corpus/pretrain.all.corpus.voca \
            -op models/[spec06,spec17] \
            -r [gnu,spec06,spec17,rwp] \
            -tt corpus/binsim.[gnu,spec06,spec17,rwp].test.corpus.txt

Practicality Evaluation

python3 binshot.py \
            -bm models/downstream/model_sim/bert_ep19.model \
            -fm models/downstream/model_sim/sim_ep19.model \
            -vp corpus/pretrain.all.corpus.voca \
            -op models/downstream \
            -r cve \
            -tt corpus/cve.corpus.txt

In this evaluation, if any of functions of interest is similar with a target, it should be predicted as positive.
To get metrics corresponding to our realistic scenario (see paper), run following command:

python result_cve.py -s models/downstream/pred.test.cve_all -v models/downstream

The result will be written in the txt file placed in models/downstream (-v option).

Generate input files with your own binaries

The following codes will be run with sample binaries in our repo.

Advance preparation

  • Binary should have execution permission.
  • Binary name format should be "binname-IA-compiler-optlv" (e.g., find-amd64-gcc-O2)
  • Run following commands
mkdir -p norm/findutils norm/cve norm/cve_strip

Running IDA Pro

bash gen_ida.sh binary/findutils/
bash gen_ida.sh binary/cve/
bash gen_ida.sh binary/cve_strip/

Normalizing assembly codes

bash gen_norm.sh binary/findutils/ norm/findutils/
bash gen_norm.sh binary/cve/ norm/cve/
bash gen_norm.sh binary/cve_strip/ norm/cve_strip/

Corpus generation for pretraining

python3 corpusgen.py -d binary/findutils/ -pkl norm/findutils/ -o corpus/ -t
python3 voca.py corpus/pretrain.findutils.corpus.voca.txt

Corpus generation for finetuning

python3 corpusgen.py -d binary/findutils/ -pkl norm/findutils/ -o corpus/ -b
python3 corpusgen.py -f corpus/binsim.findutils.corpus.txt -y binsimtask -p

Corpus generation for a realistic scenario

python3 corpusgen.py -d binary/cve/ -pkl norm/cve/ -o corpus/ -c

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If your research employs BinShot, please cite the following paper:

@INPROCEEDINGS{binshot,   
  author = {Sunwoo Ahn and Seonggwan Ahn and Hyungjoon Koo and Yunheung Paek},   
  title = {Practical Binary Code Similarity Detection with BERT-based
		   Transferable Similarity Learning}   
  booktitle = {Proceedings of the 38th Annual Computer Security
               Applications Conference (ACSAC)},   
  month = {Dec.},   
  year = {2022},   
  location = {Austin, Texas}   
}