Bad ground truth optional instructions for x86-32 and x86-64 msvc binaries
aeflores opened this issue · 1 comments
Some of the ground truth provided in https://drive.google.com/file/d/1r7Xa1RY7DAhB58Xz6xSNVVZsM9EW8zJj/view?usp=sharing is wrong: Jump tables are sometimes classified as instructions.
Example 1:
pangine-gt-data-20200701/x86-pc-linux-msvc-cl-19.26.28806/%2fO2/bin/7zip-19.00/7zDec.exe
The address range 0x405ce4 - 0x405d04
corresponds to a jump table.
However, the pangine ground truth information records the following:
@SzReadHeader2@28,0x405130,0x405D04,0x405CE2,"{""Optional"":true}"
@SzReadHeader2@28,0x405130,0x405D04,0x405CE4,"{""Optional"":true}"
...
@SzReadHeader2@28,0x405130,0x405D04,0x405D03,"{""Optional"":true}"
How do I know this is a jump table?
The binary contains the following snippet of code:
4053e9: cmp ESI,7
4053ec: ja 0x4056df
4053f2: jmp DWORD PTR [ESI*4+0x405ce4]
Which reads that range of addresses 0x405ce4 - 0x405d04
as data.
These instructions are legit since they appear in the pangine ground truth data:
@SzReadHeader2@28,0x405130,0x405D04,0x4053E9,""
@SzReadHeader2@28,0x405130,0x405D04,0x4053EC,""
@SzReadHeader2@28,0x405130,0x405D04,0x4053F2,""
So the ground truth data is at the very least inconsistent.
Example 2:
pangine-gt-data-20200701/x86-pc-linux-msvc-cl-19.26.28806/%2fO2/bin/mit-bzip2/bzip2.exe
The address range 0x409030- 0x40903c
is a jump table.
However, the ground truth classifies it as instructions:
_BZ2_bzWriteClose64@28,0x408DB0,0x409040,0x409030,"{""Optional"":true}"
_BZ2_bzWriteClose64@28,0x408DB0,0x409040,0x409031,"{""Optional"":true}"
_BZ2_bzWriteClose64@28,0x408DB0,0x409040,0x409034,"{""Optional"":true}"
_BZ2_bzWriteClose64@28,0x408DB0,0x409040,0x409039,"{""Optional"":true}"
_BZ2_bzWriteClose64@28,0x408DB0,0x409040,0x40903C,"{""Optional"":true}"
Similarly to the previous case, I know that is a jump table because the binary contains the following snippet of code:
408ea8: cmp EAX,3
408eab: ja 408fa3
408eb1: jmp DWORD PTR [EAX*4+409030]
which accesses that address range as data. These instruction are also legit and they appear in the ground truth data:
_BZ2_bzWriteClose64@28,0x408DB0,0x409040,0x408EA8,""
_BZ2_bzWriteClose64@28,0x408DB0,0x409040,0x408EAB,""
_BZ2_bzWriteClose64@28,0x408DB0,0x409040,0x408EB1,""
Example 3 (x64)
This happens for x64 too.
pangine-gt-data-20200701/x86_64-pc-linux-msvc-cl-19.26.28806/%2fO2/bin/mit-bzip2/bzip2.exe
Has a jump table starting at address 0x14000d654
but pangine include that address as code:
BZ2_decompress,0x14000AEF0,0x14000D6F8,0x14000D654,"{""Optional"":true}"
I know this is the beginning of a jump table because of the following snippet:
14000b128: lea RDX,QWORD PTR [__ImageBase]
14000b12f: cdqe
14000b131: mov ECX,DWORD PTR [RDX+RAX*4+(IMAGEREL $L_14000d654)]
14000b138: add RCX,RDX
14000b13b: jmp RCX
or as shown by ghidra:
14000b128 LEA RDX,[IMAGE_DOS_HEADER_140000000]
14000b12f CDQE
14000b131 MOV ECX,dword ptr [RDX + RAX*0x4 + offset DAT_14000d654]
14000b138 ADD RCX,RDX
14000b13b JMP RCX
These instructions are present in the pangines ground truth:
BZ2_decompress,0x14000AEF0,0x14000D6F8,0x14000B128,""
BZ2_decompress,0x14000AEF0,0x14000D6F8,0x14000B12F,""
BZ2_decompress,0x14000AEF0,0x14000D6F8,0x14000B131,""
BZ2_decompress,0x14000AEF0,0x14000D6F8,0x14000B138,""
BZ2_decompress,0x14000AEF0,0x14000D6F8,0x14000B13B,""
Thank you so much for helping verify the quality of the ground truth for us!
The optional instrutions are caused by aggressive instruction scanning. My parser may mistakenly considered a control-flow instruction as a non-control-flow instruction thus keep on scanning the next offset and consider it an instruction.
A known example of this bug in the published ground truth version is that it does not correctly classify the "hlt" and the "int" instructions. To fix this bug and to add more features, there will be a new version of ground truth published soon.
For the test based on the current ground truth, I suggest do not include the instruction offsets marked with "optional" for now (on both the ground truth side and the disassembly result side), as these results should not be trusted.