OSUSecLab/SymLM

Some question about the dataset

Closed this issue · 1 comments

Hello. Thanks for making your great work publicly-available.
Here are some questions about the dataset used in the experiment.

  • In table. 3, you reported the number of binaries of each project. For instance, the reported binaries of libtomcrypt has 6528 binaries, and binutils has 3750 binaries. However, in my knowledge, compiling libtomcrypt will produce a libtomcrypt.a and a libtomcrypt.so, and compiling binutils will generate 16 binaries like ar and objdump. Even when we compile them for 4 platforms with 4 optimizations, we cannot get so many bianries. May I ask how you calculate the number of binaries?
  • Another question is about the overlapping of your training and test dataset. Table 3 reports that you select projects like binutils and coreutils. Based on my observation, the binaries of coreutils share a large amount of common code. All source code in lib directory will be compiled into a libcoreutils.a, which is used for each binary during linking stage. The binaries of binutils face the same problem. Additionally, the selected projects may have reuse relations. For example, busybox reuses some code of coreutils. In the paper, you mentioned that you split the dataset into 8:1:1 for training, validation, and test. I guess you skip the content for elaborating reducing overlapping between them due to limited space. Could you provide some words on it?
  • I come up with the last question after reading the README of dataset_generation. It says you skip complex and trivial functions by this line. If my understanding is correct, the len(inst_pos) returns the number of tokens, and the instruction like push rsp has two tokens. This is not mentioned in the published paper. May I ask the proportion of functions being filtered out? And, is it possible to provide some statistics about the distribution of selected functions by the number of instructions, basic blocks, and so on?

Again, thanks for your excellent work. I am looking forward to your reply.

Hi there,

Thank you for your interests!

To collect the binary dataset, we had compilation scripts, which I couldn't find for now, to compile all the open source projects. After compilation, we collect all the executables compiled from different optimization levels and architectures. After that, we used the obfuscator to produce the obfuscated binaries. Finally, we count the number of all binaries generated.

To curate training, validation, and test sets, we use the 8:1:1 ratio to split the binary dataset on the binary level, which can guarantee that there is no overlapping in the binary level. This dataset generation method is the same as our baseline, NERO. We didn't perform additional deduplication steps.

Filtering input sequences based on length of tokens is a common practice in language models. And we observed that the long token sequences (e.g., length = 1,024) can introduce the poor training and validation efficiency. We didn't count the proportion and different developers of projects/binaries can have different programming styles (e.g., tending to have long function bodies). This statistics can be collected by tuning the threshold during parsing the binaries. Please let us know if we can help. Also let us know if you have any other questions.