cross-arch-instr-model.github.io

Thank you for looking at our work! The programs included here were created for the following paper:

"A Cross-Architecture Instruction Embedding Model for Natural Language Processing-Inspired Binary Code Analysis"

Kimberly Redmond, Lannan Luo, and Qiang Zeng

The NDSS Workshop on Binary Analysis Research (BAR), 2019.

############################

The trained cross-architecture instruction embedding model used in our paper are included in the output/ directory. Please remember to unzip the four output files.

Our embeddings were trained on the model Bivec, which is based on Word2Vec. You may find it here:

https://github.com/lmthang/bivec

############################

ABOUT THESE PROGRAMS

All file paths and instruction selections are hard-coded into these programs. For your convenience, they are listed in variables near the top; feel free to modify them for your use.

./senvec.py

Returns ROC plots and AUC scores for cross-architecture basic block similarity tests. Basic block embeddings are calculated by summing instruction embeddings within a block

Similarity is computed using Cosine similarity

./tsne2.py

Returns 2 t-SNE figures with different displays: 1) an unlabeled figure displaying all instructions in one vector space 2) a labeled figure displaying selected instructions in one vector space

./instr_sim.py

Returns 2 ROC plots and AUC scores for instruction-level similarity tests. Instructions are evaluated in pairs, in 2 ways: 1) mono-architecture 2) cross-architecture

The similarity metric used is cosine similarity.

./query.py

Returns a list of the top-5 most similar instructions, given an instruction. Each instruction returns the top 6 instructions from its own architecture (#1 is itself), and the top 5 instructions from the other architecture, according to cosine similarity.