This repository introduces how to use the tools we developed for our FEAST 2020 publication "On the Generation of Disassembly Ground Truth and the Evaluation of Disassemblers". It also serves to keep track of the data sets we generated for the paper and any future data releases.
- Dataset(s)
- Related Repositories
- Using Our Disassembly Ground Truth
- Generating Disassembly Ground Truth
- Reproducing Our Evaluation Results
Reading tips:
-
If you want to obtain and use the ground truth data generated by us to test a disassembler, please read Dataset(s) and Using Our Disassembly Ground Truth.
-
If you want to generate ground truth on your own (and perhaps also add other projects into the benchmark suite), please read Related Repositories, Using Our Disassembly Ground Truth, and Generating Disassembly Ground Truth.
-
If you want to verify the evaluation in our publication and reproduce our results, please read the entire file.
The following table records the release(s) of our ground truth data. The data contain both the binaries used and the corresponding ground truth generated by our system at the time indicated.
Version | File | SHA256 Checksum |
---|---|---|
2020-07-01 (FEAST 2020) | pangine-gt-data-20200701.tar.xz | 4e97cc4525b19e125eeedc3ab147d747f8a1d7856e4a695f7a357d3d649e0133 |
We organize our project using multiple related repositories owned by github.com/pangine. The following table lists these repositories and shows their relevance to the tasks of (i) ground truth generation and (ii) experiment reproduction.
Repository | Generation | Reproduction |
---|---|---|
msvc-wine | ● | ● |
llvmmc-resolver | ● | |
disasm-eval-sources | ● | ● |
disasm-gt-generator | ● | ● |
disasm-eval-disasms | ● | |
disasm-eval-cmp | ● |
We have built and tested our system using:
- Docker 19.03
- Go 1.14, with our code organized using the $GOPATH method (i.e., we are not using Go Modules yet)
- Ubuntu 20.04, running Linux kernel 5.4
We have also successfully reproduced our result with x64 macOS 10.15 and Docker 19.03.
This repository is our fork of https://github.com/mstorsjo/msvc-wine and tracks
how we generate a Docker image containing x64 Ubuntu 20.04, Wine, and Microsoft
Visual C++ (MSVC). Please make sure you build the image using the 2004-ninja
branch and tag the generated image with the name pangine/msvc-wine
. In our
system, this image is responsible for compiling the MSVC binaries and generated
the intermediate files needed for disassembly ground truth generation.
This repository contains our machine code to assembly decoding server. In our
system, we use LLVM MC for machine code decoding. This server exposes the
relevant LLVM MC APIs over a UNIX domain socket. This server is required by both
our ground truth generation toolkit and evaluation components. The Docker images
built for our evaluation, including the disassembler result extractors and the
result comparison program, are all based on the Docker image defined in this
repository. Therefore, you will need to build the image in this repository if
you want to re-execute our evaluation process. Please remember to tag the
generated image using the name pangine/llvmmc-resolver
.
This repository contains the source code of the projects that we use in our
ground truth dataset and the scripts to compile them. You need to use Go to
install the repository and call the disasm-eval-generate
executable to compile
the projects in the way needed by our other tools. Please see the README in the
repository for explanations.
This repository defines our ground truth generation toolkit, comprising the following three programs:
- The extractor that collects assembly files, object files, and mapping information from a compilation output of disasm-eval-sources.
- The ground truth generator
- The ground truth correctness checker
These tools can be installed using the Dockerfile in the repository. Please see the README of the repository for explanations.
This repository contains a set of disassembler result extractors. The goal of these extractors is to collect results from supported disassemblers and emit the results into a format we specified using Cap'n Proto, which can then be used to compare with our ground truth in our evaluation. At present, we have extractors for:
- BAP
- ddisasm
- Ghidra
- Radare2
- ROSE
These extractors can be installed using the Dockerfiles in the repository. Please see the README of the repository for explanations.
This repository contains three tools that we used for collecting data when evaluating disassembler results against our generated ground truth:
- The tool that compares disassembler results and our generated ground truth and output in CSV
- The tool that prints disassembler results or ground truth into human readable text format
- The tool that collects the statistical characteristics of generated ground truth
These tools can be installed using the Dockerfile in the repository. Please see the README of the repository for explanations.
The following repositories will be automatically installed by Go due to imports. We list them here and state their purposes. Detailed explanations can be found in the README in these repositories.
-
pangineDSM-utils: common APIs that define the basic data structures and I/O formats used by our tools
-
pangineDSM-obj-x86-elf: APIs to classify the x86/x64 Linux ELF assembly instructions emitted by LLVM MC
-
pangineDSM-obj-x86-coff: APIs to classify the x86/x64 Windows COFF assembly instructions emitted by LLVM MC
-
pangineDSM-import: Cap'n Proto specification used by disasm-eval-disasms when outputting disassemblers results and Ghidra scripts to use Ghidra as a disassembler in its headless mode
In our system, the ground truth generated for a given binary BIN
is stored in
a SQLite3 database called BIN.sqlite
. The SQL statements below shows and
explains the schema of the database:
CREATE TABLE insn
-- insn stores the ground truth instructions in the binary.
(
/* offset records the start virtual address of every instruction in the
binary. */
offset INTEGER PRIMARY KEY,
/* supplementary records additional information of an instruction in a JSON
object. At present, this object can contain one field called "optional" to
indicate whether an instruction fits a specific concept of optional as
explained in Sections 3.2.2 & 3.2.3 of our arXiv paper. As an optimization,
this field can be the empty string if the JSON object is the empty object. */
supplementary TEXT
);
CREATE TABLE funcs
-- func stores the ground truth functions in the binary.
(
/* id is an autoincrement counter to identify different functions. */
id INTEGER PRIMARY KEY AUTOINCREMENT,
/* name records the function name of a function according to the symbol table.
A function names may appear multiple times in the symbol table and so this
column should not have a UNIQUE constraint. */
name TEXT,
/* start and end specify the semi-open virtual address range spanned by a
function, i.e., [start, end). Functions may have overlaps.
*/
start INTEGER,
end INTEGER
);
CREATE TABLE func2insns
-- func2insns is a many-to-many relation between functions and instructions.
(
/* id is an autoincrement counter used only as a primary key. */
id INTEGER PRIMARY KEY AUTOINCREMENT,
/* fid is meant to be a foreign key to the func table id column. */
fid INTEGER,
/* insn is meant to be a foreign key to the insn table offset column. */
insn INTEGER
);
To read BIN.sqlite
, you need to open a SQLite3 shell using the SQLite3
executable with the following shell command:
sqlite -init /dev/null BIN.sqlite
The following queries show some examples on how to select data from the database using the SQLite3 shell.
-
If you want to get all instruction offsets of the binary:
sqlite> SELECT * FROM insn;
-
If you want to get all instruction offsets and supplementary data in the function
main
in the binary:sqlite> SELECT insn.offset, insn.supplementary ...> FROM insn ...> JOIN func2insns ON insn.offset = func2insns.insn ...> JOIN func ON func2insns.fid = func.id ...> WHERE func.name = "main";
-
If you want to dump the entire database into a CSV file with all tables joined together and numbers formatted in hex:
sqlite> .output BIN.csv sqlite> .headers on sqlite> .mode csv sqlite> SELECT ...> func.name AS "Function Name", ...> printf('0x%X', func.start) AS "Function Start", ...> printf('0x%X', func.end) AS "Function End", ...> printf('0x%X', insn.offset) AS "Instruction Offset", ...> insn.supplementary AS "Instruction Supplementary" ...> FROM insn ...> JOIN func2insns ON insn.offset = func2insns.insn ...> JOIN func ON func2insns.fid = func.id;
Note that the sqlite3 executable accepts SQL statements and SQLite3 meta-commands as arguments. Therefore, the above example can also be scripted into one shell command:
sqlite3 -init /dev/null BIN.sqlite \
".output BIN.csv" \
".headers on" \
".mode csv" \
"SELECT func.name AS 'Function Name', \
printf('0x%X', func.start) AS 'Function Start', \
printf('0x%X', func.end) AS 'Function End', \
printf('0x%X', insn.offset) AS 'Instruction Offset', \
insn.supplementary AS 'Instruction Supplementary' \
FROM insn JOIN func2insns ON insn.offset = func2insns.insn \
JOIN func ON func2insns.fid = func.id;"
IMPORTANT: At present, our ground truth includes ONLY:
- Functions and instructions emitted due to the source code of the binary
and does NOT include:
- Functions and instructions from statically-linked libraries
- Nop instructions inserted for alignment between two neighboring functions
- Other functions and instructions inserted by the compiler toolchain
When comparing a disassembly result with our ground truth, we recommend comparing only instructions that are in range of the functions recorded in our ground truth.
To compile the projects in our benchmark suite and generate the disassembly ground truth with your own machine, please follow the installation instructions in the following repositories:
- msvc-wine
- disasm-eval-sources
- disasm-gt-generator
Assuming you have successfully installed all these repositories, here are the steps to generate the disassembly ground truth on your own:
-
Read the README in disasm-eval-sources. For each project and configuration you want, call the
disasm-eval-generate
executable to compile the binary and generate an XZ archive containing the build directory of the binary. -
Decompress the output XZ archive(s).
-
Read the README in disasm-gt-generator. Run the executables in the installed Docker image to generate the disassembly ground truth on the projects selected above.
-
The ground truth file(s) should now exist under the folder(s) you decompressed in step 2.
Speaking of the reproducibility of this project, a nature question is: by executing the commands we provide, can someone reproduce our evaluation results? The answer to this question is both yes and no:
-
Yes: You will be able to regenerate the binaries, the disassembly ground truth, and the disassembly results, and then do a comparison on your own host using our tools. See Rerunning Disassembler Evaluations below.
-
No: However, your evaluation results may vary because that the binaries can change if you build it under different environments. See Expected Differences below.
To rerun the evaluation, first follow the steps in Generating Disassembly Ground Truth to generate the binaries and the ground truth that we used in our publication. You will also need to follow the installation instructions in the following repositories:
- llvmmc-resolver
- disasm-eval-disasms
- disasm-eval-cmp
Assuming that you have generated all the ground truth in YOURPATH
and
installed all the Docker images from the above repositories, here are the steps
to compare the generated ground truth with the disassembly results:
-
Read the README of disasm-eval-disasms. Run all disassemblers using their own Docker images on all the binaries in your dataset under
YOURPATH
. The disassembly results will be outputted inYOURPATH
under each configuration folder. -
Read the README of disasm-eval-cmp. Run
disasm-eval-cmp
using the installed Docker image on every project and configuration underYOURPATH
in a loop, and it will print the comparison results through stdout. -
Run
disasm-gt-chrct
in disasm-eval-cmp on every project and configuration in a loop to collect statistical characteristics of your generated ground truth and print the results through stdout.
Our FEAST 2020 publication uses a dataset that was generated in early July 2020. Subsequent to our publication, we have re-generated a dataset using the above steps in late November 2020. In the reproduction test, we installed the repositories on a clean host and regenerated all the ground truth for all the projects and configurations we used in the old version for using GCC 5.4.0 (Ubuntu 16.04), GCC 7.5.0 (Ubuntu 18.04), Clang 3.8.0 (Ubuntu 16.04), and Clang 6.0.0 (Ubuntu 18.04). We did not regenerate using MSVC and ICC in this test. We also did not rerun the disassemblers.
The following table categorizes and explains all the changes that we found in the binaries and the generated ground truth between the old and new datasets.
Binary Changed? | .text Changed? | .bss Changed? | Ground Truth Changed? | Instruction Relative Offsets within Function in Ground Truth Changed? | Binaries Involved | Explanations |
---|---|---|---|---|---|---|
F | F | F | F | F | All binaries that are not listed in the cells below | The old binaries and the new binaries are exactly the same and the ground truth generated are the same. |
T | F | T | F | F | lighttpd (all platforms); vim (Ubuntu 16.04); GCC, sshd, oggenc, bzip2, gzip, nginx, pcre2grep, sqlite3, vsftpd (Ubuntu 18.04) |
For lighttpd and vim, the reasons of the differences are: (i) There are strings in .bss recording the build dates and time. Since the old and the new binaries were built at different time, the strings are different. (ii) vim records compilation commands as strings in .bss. Since the build script of vim use temp files with randomized names, these strings are also changed. For the other binaries under Ubuntu 18.04, the reason of the differences is that in some large uninitialized static variables, the default filling bytes of the variables have been changed. The cases are compiler independent and happened in both GCC 7.5.0 and Clang 6.0.0 with the same changing patterns. |
T | T | T | F | F | vim (Ubuntu 18.04) | The changes in vim under Ubuntu 18.04 include all the conditions explained above. In addition, the ordering of some data variables in .bss have changed. For example, in the old and the new binaries built by GCC 7.5.0 using -O0, the two variables ui_post_balloon and pum_show_popupmenu have their locations in the binary swapped. As a result, the instructions in .text that reference these two variables also change because of the different addresses. However, since the instruction lengths and locations do not change, these changes do not affect our ground truth, which only records the starting offset of each instruction. |
T | T | T | T | F | cstool (all platforms) | Comparing the cstool binaries in the old and the new versions, the order of the functions has changed. For example, in the two versions of the cstool binaries compiled using GCC 5.4.0 with -O1, the orders of the first 17 functions in the ground truth are different. In the old version, function main is the 6th function, and it becomes the 12th in the new version. We have checked the assembly files of the two versions and found that the old and the new assembly files are the identical. We have also discovered that although the function orders have been changed, the instructions in every function are the same if we only check their relative offsets to the start of their containing functions. |
T | T | T | T | T | exim (all platforms) | In exim binaries compiled at high optimization levels, we found there was an instruction in the old version that disappeared in the new version. According to our investigation, the instruction is an if statement that has its condition depending on the __DATE__ constant (src/version.c:44). The condition is true only when the date of the compilation contains only a single digit. As a result, the compiler will remove this instruction from the output assembly file at high optimization levels if the date of compilation contains double digits. |
According to the table above, we consider that for most of the cases, although the produced binaries may change in a re-compilation, our ground truth presents the same information at the function level by keeping the instruction relative offsets in functions in the generated ground truth the same across the old and new datasets. The only exception is exim, which contain code that depends on the date of the compilation. However, the differences in exim can be prevented by controlling the system time when compiling.