BinSum - Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models

Important

We are in the process of releasing the dataset, adding more data/implementation details, and improving the documents. Please stay tuned.

What's New?

[Dec. 17, 2023] Our paper has been publicly available on arXiv. We are in the process of releasing the dataset.

Introduction

BinSum is a comprehensive benchmark and dataset of over 557K binary functions and introduce a novel method for prompt synthesis and optimization. To more accurately gauge LLM performance, we also propose a new semantic similarity metric that surpasses traditional exact-match approaches. Our extensive evaluation of prominent LLMs, including ChatGPT, GPT-4, Llama 2, and Code Llama, reveals 10 pivotal insights. This evaluation generates 4 billion inference tokens, incurred a total expense of 11,418 US dollars and 873 NVIDIA A100 GPU hours. Our findings highlight both the transformative potential of LLMs in this field and the challenges yet to be overcome.

Dataset

Coming soon.

Citation

If you find BinSum useful, please consider citing our paper:

BibTeX:

@article{jin2023binary,
  title={Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models},
  author={Xin Jin and Jonathan Larson and Weiwei Yang and Zhiqiang Lin},
  journal={arXiv preprint arXiv:2312.09601},
  year={2023},
}

xinjin95/BinSum

BinSum - Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models

What's New?

Introduction

Dataset

Citation