Some question about the dataset
Closed this issue · 1 comments
Hello. Thanks for making your great work publicly-available.
Here are some questions about the dataset used in the experiment.
- In table. 3, you reported the number of binaries of each project. For instance, the reported binaries of
libtomcrypt
has 6528 binaries, andbinutils
has 3750 binaries. However, in my knowledge, compilinglibtomcrypt
will produce alibtomcrypt.a
and alibtomcrypt.so
, and compilingbinutils
will generate 16 binaries likear
andobjdump
. Even when we compile them for 4 platforms with 4 optimizations, we cannot get so many bianries. May I ask how you calculate the number of binaries? - Another question is about the overlapping of your training and test dataset. Table 3 reports that you select projects like
binutils
andcoreutils
. Based on my observation, the binaries ofcoreutils
share a large amount of common code. All source code inlib
directory will be compiled into alibcoreutils.a
, which is used for each binary during linking stage. The binaries ofbinutils
face the same problem. Additionally, the selected projects may have reuse relations. For example,busybox
reuses some code ofcoreutils
. In the paper, you mentioned that you split the dataset into 8:1:1 for training, validation, and test. I guess you skip the content for elaborating reducing overlapping between them due to limited space. Could you provide some words on it? - I come up with the last question after reading the
README
ofdataset_generation
. It says you skip complex and trivial functions by this line. If my understanding is correct, thelen(inst_pos)
returns the number of tokens, and the instruction likepush rsp
has two tokens. This is not mentioned in the published paper. May I ask the proportion of functions being filtered out? And, is it possible to provide some statistics about the distribution of selected functions by the number of instructions, basic blocks, and so on?
Again, thanks for your excellent work. I am looking forward to your reply.
Hi there,
Thank you for your interests!
To collect the binary dataset, we had compilation scripts, which I couldn't find for now, to compile all the open source projects. After compilation, we collect all the executables compiled from different optimization levels and architectures. After that, we used the obfuscator to produce the obfuscated binaries. Finally, we count the number of all binaries generated.
To curate training, validation, and test sets, we use the 8:1:1 ratio to split the binary dataset on the binary level, which can guarantee that there is no overlapping in the binary level. This dataset generation method is the same as our baseline, NERO. We didn't perform additional deduplication steps.
Filtering input sequences based on length of tokens is a common practice in language models. And we observed that the long token sequences (e.g., length = 1,024) can introduce the poor training and validation efficiency. We didn't count the proportion and different developers of projects/binaries can have different programming styles (e.g., tending to have long function bodies). This statistics can be collected by tuning the threshold during parsing the binaries. Please let us know if we can help. Also let us know if you have any other questions.