m4b/cargo-sym

ARM/Thumb disassembly is wrong

japaric opened this issue · 5 comments

Objdump

$ arm-none-eabi-objdump -Cd target/thumbv7m-none-eabi/debug/app

target/thumbv7m-none-eabi/debug/app:     file format elf32-littlearm


Disassembly of section .text:

00000000 <_reset-0x8>:
   0:   20010000        .word   0x20010000
   4:   00000009        .word   0x00000009

00000008 <_reset>:
   8:   b083            sub     sp, #12
   a:   e7ff            b.n     c <_reset+0x4>
   c:   202a            movs    r0, #42 ; 0x2a
   e:   9001            str     r0, [sp, #4]
  10:   9002            str     r0, [sp, #8]
  12:   e7ff            b.n     14 <_reset+0xc>
  14:   e7fe            b.n     14 <_reset+0xc>

cargo-sym

$ cargo sym -Cd target/thumbv7m-none-eabi/debug/app
Disassembly of section .text

0000000000000009 _reset:
10009:   b0 ff e7 2a                                     bhs #0xffa0fed1
1000d:   20 01 90 02                                     addseq r0, r0, #8
10011:   90 ff e7 fe                                     mcr2 p15, #7, pc, c7, c0, #4

There's some funny things going on here:

  • 0000000000000009 has too many zeroes for a 32-bit hexadecimal. ARMv7-M is a 32-bit architecture.
  • 0x9 is the "THUMB address" of _reset (the bit 0 is set to 1) but the disassembly should start at 0x8. It seems that cargo-sym is starting to disassemble at address 0xA (see the value of the instructions: for objdump is 83 b0 ff e7 2a ..., for cargo-sym is b0 ff e7 2a ...
  • The disassembly should show THUMB instructions and those are 16-bit instructions. cargo-sym is interpreting the values as THUMB-2 instructions (32-bit instructions)
  • On the cargo sym output, there's the address 10009 right below _reset. The address seem to be off by 0x10000.

I will post the binary used in this report in a bit.

I will post the binary used in this report in a bit.

Check this gist. The files with extension .thumbv* are ARM binaries. The sources of these binaries are in this repo

m4b commented

fix

fixed some formatting issues, in addition to checking whether the thumb bit is set and disassembling in thumb mode if so.

now generates:

target/debug/cargo-sym sym -Cd 01-qemu.thumbv7m-none-eabi 
Disassembly of section .text

00000008 <_reset>
       8:  b083             sub sp, #0xc
       a:  e7ff             b #0xc
       c:  202a             movs r0, #0x2a
       e:  9001             str r0, [sp, #4]
      10:  9002             str r0, [sp, #8]
      12:  e7ff             b #0x14
      14:  e7fe             b #0x14

and:

target/debug/cargo-sym sym -Cd 04-led.thumbv7em-none-eabihf 
Disassembly of section .text

08000008 <_EXCEPTIONS>
 8000008:  080000ed         stmdaeq r0, {r0, r2, r3, r5, r6, r7}
 800000c:  080000ed         stmdaeq r0, {r0, r2, r3, r5, r6, r7}
 8000010:  080000ed         stmdaeq r0, {r0, r2, r3, r5, r6, r7}
 8000014:  080000ed         stmdaeq r0, {r0, r2, r3, r5, r6, r7}
 8000018:  080000ed         stmdaeq r0, {r0, r2, r3, r5, r6, r7}
 800001c:  00000000         andeq r0, r0, r0
 8000020:  00000000         andeq r0, r0, r0
 8000024:  00000000         andeq r0, r0, r0
 8000028:  00000000         andeq r0, r0, r0
 800002c:  080000ed         stmdaeq r0, {r0, r2, r3, r5, r6, r7}
 8000030:  00000000         andeq r0, r0, r0
 8000034:  00000000         andeq r0, r0, r0
 8000038:  080000ed         stmdaeq r0, {r0, r2, r3, r5, r6, r7}
 800003c:  080000ed         stmdaeq r0, {r0, r2, r3, r5, r6, r7}

08000040 <_reset>
 8000040:  b580             push {r7, lr}
 8000042:  466f             mov r7, sp
 8000044:  b082             sub sp, #8
 8000046:  e7ff             b #0x8000048
 8000048:  f80cf000         bl #0x8000064
 800004c:  e7ff             b #0x800004e
 800004e:  f817f000         bl #0x8000080
 8000052:  e7ff             b #0x8000054
 8000054:  f82bf000         bl #0x80000ae
 8000058:  e7ff             b #0x800005a
 800005a:  f837f000         bl #0x80000cc
 800005e:  e7ff             b #0x8000060
 8000060:  e7ff             b #0x8000062
 8000062:  e7fe             b #0x8000062

08000064 <app::power_on_gpioe::h78af44f23a22f67e>
 8000064:  b082             sub sp, #8
 8000066:  e7ff             b #0x8000068
 8000068:  e7ff             b #0x800006a
 800006a:  0014f241         movw r0, #0x1014
 800006e:  0002f2c4         movt r0, #0x4002
 8000072:  9000             str r0, [sp]
 8000074:  6801             ldr r1, [r0]
 8000076:  1100f441         orr r1, r1, #0x200000
 800007a:  6001             str r1, [r0]
 800007c:  b002             add sp, #8
 800007e:  4770             bx lr

08000080 <app::put_pe9_in_output_mode::h6338568d6b3f648b>
 8000080:  b083             sub sp, #0xc
 8000082:  e7ff             b #0x8000084
 8000084:  e7ff             b #0x8000086
 8000086:  0000f241         movw r0, #0x1000
 800008a:  0000f6c4         movt r0, #0x4800
 800008e:  9002             str r0, [sp, #8]
 8000090:  6800             ldr r0, [r0]
 8000092:  9001             str r0, [sp, #4]
 8000094:  e7ff             b #0x8000096
 8000096:  9801             ldr r0, [sp, #4]
 8000098:  2140f420         bic r1, r0, #0xc0000
 800009c:  9100             str r1, [sp]
 800009e:  e7ff             b #0x80000a0
 80000a0:  9802             ldr r0, [sp, #8]
 80000a2:  9900             ldr r1, [sp]
 80000a4:  2280f441         orr r2, r1, #0x40000
 80000a8:  6002             str r2, [r0]
 80000aa:  b003             add sp, #0xc
 80000ac:  4770             bx lr

080000ae <app::set_pe9_high::h14fcedcfc4b06dbb>
 80000ae:  b082             sub sp, #8
 80000b0:  e7ff             b #0x80000b2
 80000b2:  e7ff             b #0x80000b4
 80000b4:  0018f241         movw r0, #0x1018
 80000b8:  0000f6c4         movt r0, #0x4800
 80000bc:  9000             str r0, [sp]
 80000be:  e7ff             b #0x80000c0
 80000c0:  9800             ldr r0, [sp]
 80000c2:  7100f44f         mov.w r1, #0x200
 80000c6:  6001             str r1, [r0]
 80000c8:  b002             add sp, #8
 80000ca:  4770             bx lr

080000cc <app::set_pe9_low::h5d9c159fa5571658>
 80000cc:  b082             sub sp, #8
 80000ce:  e7ff             b #0x80000d0
 80000d0:  e7ff             b #0x80000d2
 80000d2:  0018f241         movw r0, #0x1018
 80000d6:  0000f6c4         movt r0, #0x4800
 80000da:  9000             str r0, [sp]
 80000dc:  e7ff             b #0x80000de
 80000de:  e7ff             b #0x80000e0
 80000e0:  9800             ldr r0, [sp]
 80000e2:  7100f04f         mov.w r1, #0x2000000
 80000e6:  6001             str r1, [r0]
 80000e8:  b002             add sp, #8
 80000ea:  4770             bx lr

080000ec <app::exception::handler::hac6f2ae6b7dd2702>
 80000ec:  b083             sub sp, #0xc
 80000ee:  e7ff             b #0x80000f0
 80000f0:  be00             bkpt #0
 80000f2:  e7ff             b #0x80000f4
 80000f4:  e7fe             b #0x80000f4

I also added a --dump flag to output the debug format of the binary file it read, which will be nice(r) for bug reports.

Thanks for the detailed bug report(s)!

discussion

The thumb bit flag blew my mind. For anyone reading this, the problem was that I was disassembling using the offset given by the symbol's st_value field. In the above example, this is (correctly/incorrectly) 9 (i.e., if you inspect the raw st_value given by the ELF binary, it is 9, not 8). But because all arm assembly instructions are even, the odd bit in an instruction address was able to be repurposed to signify to the processor (or disassembler in our case) that the instruction is a thumb instruction, and not a regular arm32 instruction. At least, that's the gist of what I understood.

Consequently, one must essentially check whether the address is odd, and if so, switch to thumb disassembly mode, and subtract -1 from both the offset and virtual memory address to correctly disassemble at the right location and to get the correct instruction display. crazy! :)

misc

Do you like the <> around symbol names? I copied objdump because i could, but I don't know if i like it.

Also it currently incorrectly displays (i think) 4-byte instructions like:

80000e2:  7100f04f         mov.w r1, #0x2000000

which should be rendered:

80000e2:  f04f 7100         mov.w r1, #0x2000000

for whatever reason?

because all arm assembly instructions are even

To be pedantic: pointers to code are actually 4-byte aligned. (See Section 4.1 of the AAPCS). And, yes, the bit 0 is used to indicate "thumb mode" (that is the subroutine contains thumb instruction) when it's set to 1.

This output:

08000008 <_EXCEPTIONS>
 8000008:  080000ed         stmdaeq r0, {r0, r2, r3, r5, r6, r7}
 800000c:  080000ed         stmdaeq r0, {r0, r2, r3, r5, r6, r7}

cargo-sym shouldn't show instructions in this case because this _EXCEPTIONS symbol is actually a static variable (an array of function pointers: [fn(); 14]) and was originally in the .rodata section (I used a linker script to move it to .text) so it's "data" not "code". I don't know how ELF represent this ... the section is marked as not executable in the ELF, maybe?

Do you like the <> around symbol names?

Actually, sometimes I use it as an "anchor" when I'm viewing the disassembly with less. I search for >: and it takes me to the next symbol.

m4b commented

To be pedantic: pointers to code are actually 4-byte aligned. (See Section 4.1 of the AAPCS).

To be pedantic back, the pointers are all still even :P

Yea the _EXCEPTIONS is easily fixed; those symbols are usually tagged <LOCAL|GLOBAL> OBJECT. So in the printing routine if it's not tagged as a function, i'll print it as data. (e.g., offset: 4|8 byte chunks of 4 ). Unfortunately if they're not tagged as OBJECT, no way I can know they're data or code without more heavy weight analysis. (blame Von Neumann for this awful state of affairs in binary program analysis ;))

And I'll keep the > then.

m4b commented

@japaric this is fixed in latest git version. will publish a crate version asap, need to publish goblin and fixup the capstone-rs situation, since the PR isn't being merged which it requires :/ may need to publish another crate.

But anyway, the arm printer for objects should be working:

target : "04-led.thumbv7em-none-eabihf"
Disassembly of section .text

08000008 <_EXCEPTIONS>:
 8000008: 080000ed 080000ed 00000000 00000000 ...í...í........
 8000018: 080000ed 00000000 00000000 00000000 ...í............
 8000028: 00000000 080000ed 080000ed 080000ed .......í...í...í
 8000038: 080000ed 080000ed                   ...í...í

i'm sure it has some minor bugs with edge cases. but that code was horrible and i don't feel like messing with it. someone else can work on it if they're bored and feel like writing columb-based printer code :P i'm sure they'll do a much better job than me

There will be other bugs, but with the new target api, I think you can be lazier than ever!

Let me know how it goes :)