DeepBitsTechnology/DeepDi

About Dataset

Closed this issue · 9 comments

Hi, I am a little confused about the LLVM 11 dataset used in your paper, which binary files are included in this dataset, and how to get groundtruth for this dataset? If it is convenient, please share some details with me.

Thanks for your interest in our paper!

The LLVM dataset is from this repository: https://github.com/llvm/llvm-project. The binaries are compiled by MSVC with different optimization options, and we get the groundtruth from the generated pdb files.
The pdb files can be parsed by DIA2Dump that comes with MSVC.

Sorry, I still don't have a clear understanding of the LLVM dataset, the repository: https://github.com/llvm/llvm-project explains how to build LLVM, after successfully building LLVM, many PE files are generated, do these pe files constitute the LLVM dataset? I don't know if this understanding is correct, I hope to get your reply as soon as possible.

Yes, your understanding is correct.

Hi, does the dataset BAP corpora mentioned in the paper refer to repos of BAP like https://github.com/BinaryAnalysisPlatform/x86-binaries?

OK, thanks!

Hi, when I am trying to get the groundtruth for the binaries compiled by MSVC, I am confused about data ranges in the code section, as you mentioned, "treat the data address to the end of that label as data", so I'm curious about how to get the end address of the label, and whether this estimation method is accurate enough?

@peicwang You can get line number information from PDB files, which contains the number of bytes associated to a line.
If a data label falls within the range of a certain line, the end of the data label is the end of the line.

We cannot guarantee it is 100% accurate, but we haven't observed any error so far, and Datalog is happy with our ground truth (100% accuracy).

We can be more confident about our ground truth if this issue is resolved.

Thank you, it works now.