pangine/disasm-benchmark

Bad ground truth optional instructions for x86-32 and x86-64 msvc binaries

aeflores opened this issue · 1 comments

Some of the ground truth provided in https://drive.google.com/file/d/1r7Xa1RY7DAhB58Xz6xSNVVZsM9EW8zJj/view?usp=sharing is wrong: Jump tables are sometimes classified as instructions.

Example 1:

pangine-gt-data-20200701/x86-pc-linux-msvc-cl-19.26.28806/%2fO2/bin/7zip-19.00/7zDec.exe

The address range 0x405ce4 - 0x405d04 corresponds to a jump table.

However, the pangine ground truth information records the following:

@SzReadHeader2@28,0x405130,0x405D04,0x405CE2,"{""Optional"":true}"
@SzReadHeader2@28,0x405130,0x405D04,0x405CE4,"{""Optional"":true}"
...
@SzReadHeader2@28,0x405130,0x405D04,0x405D03,"{""Optional"":true}"

How do I know this is a jump table?
The binary contains the following snippet of code:

          4053e9:   cmp ESI,7
          4053ec:   ja 0x4056df
          4053f2:   jmp DWORD PTR [ESI*4+0x405ce4]

Which reads that range of addresses 0x405ce4 - 0x405d04 as data.
These instructions are legit since they appear in the pangine ground truth data:

@SzReadHeader2@28,0x405130,0x405D04,0x4053E9,""
@SzReadHeader2@28,0x405130,0x405D04,0x4053EC,""
@SzReadHeader2@28,0x405130,0x405D04,0x4053F2,""

So the ground truth data is at the very least inconsistent.

Example 2:

pangine-gt-data-20200701/x86-pc-linux-msvc-cl-19.26.28806/%2fO2/bin/mit-bzip2/bzip2.exe

The address range 0x409030- 0x40903c is a jump table.
However, the ground truth classifies it as instructions:

_BZ2_bzWriteClose64@28,0x408DB0,0x409040,0x409030,"{""Optional"":true}"
_BZ2_bzWriteClose64@28,0x408DB0,0x409040,0x409031,"{""Optional"":true}"
_BZ2_bzWriteClose64@28,0x408DB0,0x409040,0x409034,"{""Optional"":true}"
_BZ2_bzWriteClose64@28,0x408DB0,0x409040,0x409039,"{""Optional"":true}"
_BZ2_bzWriteClose64@28,0x408DB0,0x409040,0x40903C,"{""Optional"":true}"

Similarly to the previous case, I know that is a jump table because the binary contains the following snippet of code:

          408ea8:   cmp EAX,3
          408eab:   ja 408fa3
          408eb1:   jmp DWORD PTR [EAX*4+409030]

which accesses that address range as data. These instruction are also legit and they appear in the ground truth data:

_BZ2_bzWriteClose64@28,0x408DB0,0x409040,0x408EA8,""
_BZ2_bzWriteClose64@28,0x408DB0,0x409040,0x408EAB,""
_BZ2_bzWriteClose64@28,0x408DB0,0x409040,0x408EB1,""

Example 3 (x64)

This happens for x64 too.

pangine-gt-data-20200701/x86_64-pc-linux-msvc-cl-19.26.28806/%2fO2/bin/mit-bzip2/bzip2.exe

Has a jump table starting at address 0x14000d654 but pangine include that address as code:

BZ2_decompress,0x14000AEF0,0x14000D6F8,0x14000D654,"{""Optional"":true}"

I know this is the beginning of a jump table because of the following snippet:

          14000b128:   lea RDX,QWORD PTR [__ImageBase]
          14000b12f:   cdqe 
          14000b131:   mov ECX,DWORD PTR [RDX+RAX*4+(IMAGEREL $L_14000d654)]
          14000b138:   add RCX,RDX
          14000b13b:   jmp RCX

or as shown by ghidra:

       14000b128                 LEA        RDX,[IMAGE_DOS_HEADER_140000000]
       14000b12f                 CDQE
       14000b131                 MOV        ECX,dword ptr [RDX + RAX*0x4 + offset DAT_14000d654]
       14000b138                 ADD        RCX,RDX
       14000b13b                 JMP        RCX

These instructions are present in the pangines ground truth:

BZ2_decompress,0x14000AEF0,0x14000D6F8,0x14000B128,""
BZ2_decompress,0x14000AEF0,0x14000D6F8,0x14000B12F,""
BZ2_decompress,0x14000AEF0,0x14000D6F8,0x14000B131,""
BZ2_decompress,0x14000AEF0,0x14000D6F8,0x14000B138,""
BZ2_decompress,0x14000AEF0,0x14000D6F8,0x14000B13B,""

Thank you so much for helping verify the quality of the ground truth for us!

The optional instrutions are caused by aggressive instruction scanning. My parser may mistakenly considered a control-flow instruction as a non-control-flow instruction thus keep on scanning the next offset and consider it an instruction.

A known example of this bug in the published ground truth version is that it does not correctly classify the "hlt" and the "int" instructions. To fix this bug and to add more features, there will be a new version of ground truth published soon.

For the test based on the current ground truth, I suggest do not include the instruction offsets marked with "optional" for now (on both the ground truth side and the disassembly result side), as these results should not be trusted.