Infinite backtrace, odd backtrace behavior and "corrupted stack?" in GDB

Question

Infinite backtrace, odd backtrace behavior and "corrupted stack?" in GDB

japaric opened this issue 7 years ago · 25 comments

STR

$ cargo generate --git ~/rust-embedded/cortex-m-quickstart --name app && cd app

$ cargo add panic-abort

$ # modify examples/panic.rs to use panic-abort

$ # change memory.x to match the memory layout of the STM32F3DISCOVERY

$ # uncomment thumbv7em-none-eabi target and gdb runner in .cargo/config 

$ cargo run --example panic
(gdb) backtrace  # "corrupt stack?"
#0  DefaultPreInit ()
    at /home/japaric/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-rt-0.6.5/src/lib.rs:556
#1  0x08000412 in Reset ()
    at /home/japaric/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-rt-0.6.5/src/lib.rs:496
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) continue
Breakpoint 4, main () at examples/panic.rs:29
29          panic!("Oops")

(gdb) backtrace  # where did Reset go?
#0  main () at examples/panic.rs:29

(gdb) continue
Breakpoint 3, rust_begin_unwind (_info=0x10001fb4) at (..)
31          unsafe { intrinsics::abort() }

(gdb) backtrace  # infinite backtrace
#0  rust_begin_unwind (_info=0x10001fb4) at (..)
#1  0x080006b0 in core::panicking::panic_fmt () at libcore/panicking.rs:77
#2  0x080006b0 in core::panicking::panic_fmt () at libcore/panicking.rs:77
#3  0x080006b0 in core::panicking::panic_fmt () at libcore/panicking.rs:77
(..)

Metadata

$ rustc -V
rustc 1.31.0-nightly (f99911a4a 2018-10-23)

$ grep cortex-m-rt Cargo.lock
 "cortex-m-rt 0.6.5 (registry+https://github.com/rust-lang/crates.io-index)"

$ arm-none-eabi-gdb --version
GNU gdb (GDB) 8.2

Has anyone experienced these issues? I've seen them with previous releases of cortex-m-rt so I don't think they are related to the recent changes to HardFault. Also, in the above logs all backtrace invocations are before reaching HardFault.

Answer 1 · 2018-10-26T19:29:44.000Z

I did see similar issues when I was running through "Chapter 6 - Hello, world!" of the Discovery book. I took a quick look at the time and seem to remember that the panic related functions were modifying important registers like LR without first saving them on the stack. I didn't really dig into it since I am a Rust noob and I figured that was the expected behavior for Rust's panic handling code. If you want, I can investigate this. I don't think my HardFault change has anything to do with this but I can make sure that is the case and I think they are probably a similar type of issue.

Answer 2 · 2018-10-26T22:38:06.000Z

You are correct. core::panicking::panic modifies LR without first pushing it onto the stack.

$ cd $(rustc --print sysroot)

$ arm-none-eabi-objdump -Cd lib/rustlib/thumbv7m-none-eabi/lib/libcore-*.rlib
(..)
00000000 <core::panicking::panic>:
   0:   b08c            sub     sp, #48 ; 0x30
   2:   e890 1006       ldmia.w r0, {r1, r2, ip}
   6:   e9d0 3e03       ldrd    r3, lr, [r0, #12]
   a:   6940            ldr     r0, [r0, #20]
   c:   e9cd 1206       strd    r1, r2, [sp, #24]
  10:   a906            add     r1, sp, #24
  12:   f240 0200       movw    r2, #0
  16:   9100            str     r1, [sp, #0]
  18:   2101            movs    r1, #1
  1a:   f2c0 0200       movt    r2, #0
  1e:   9101            str     r1, [sp, #4]
  20:   2100            movs    r1, #0
  22:   e9cd 1102       strd    r1, r1, [sp, #8]
  26:   e9cd 2104       strd    r2, r1, [sp, #16]
  2a:   a908            add     r1, sp, #32
  2c:   e9cd c308       strd    ip, r3, [sp, #32]
  30:   e9cd e00a       strd    lr, r0, [sp, #40]       ; 0x28
  34:   4668            mov     r0, sp
  36:   f7ff fffe       bl      0 <core::panicking::panic>
  3a:   defe            udf     #254    ; 0xfe

It seems that all divergent functions (fn(..) -> !) are given the noreturn attribute in LLVM-IR and this gives LLVM permission to thrash the LR register when optimizing the code. See example below:

#![feature(asm)]
#![no_main]
#![no_std]

extern crate panic_abort;

use cortex_m_rt::entry;

#[entry]
fn main() -> ! {
    foo();
    bar();
    baz()
}

#[inline(never)]
fn foo() {
    unsafe {
        asm!("" :: "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) :: "volatile");
    }
}

#[inline(never)]
fn bar() {
    unsafe {
        asm!("" :: "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) :: "volatile");
    }
}

#[inline(never)]
fn baz() -> ! {
    unsafe {
        asm!("" :: "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) :: "volatile");
    }

    quux()
}

#[inline(never)]
fn quux() -> ! {
    unsafe {
        asm!("" :: "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) "r"(6) :: "volatile");
    }

    loop {}
}

$ cargo rustc --example asm --release -- --emit=llvm-ir

$ cat $(find -name '*.ll')
; asm::foo
; Function Attrs: noinline nounwind
define internal fastcc void @_ZN3asm3foo17h42aa59da8b4484e3E() unnamed_addr #0 {
start:
  tail call void asm sideeffect "", "r,r,r,r,r,r"(i32 0, i32 1, i32 2, i32 3, i32 4, i32 5) #3, !srcloc !513
  ret void
}

; asm::bar
; Function Attrs: noinline nounwind
define internal fastcc void @_ZN3asm3bar17h6e8a9117df8b0663E() unnamed_addr #0 {
start:
  tail call void asm sideeffect "", "r,r,r,r,r"(i32 0, i32 1, i32 2, i32 3, i32 4) #3, !srcloc !514
  ret void
}

; asm::baz
; Function Attrs: noinline noreturn nounwind
define internal fastcc void @_ZN3asm3baz17h32a1e9952bcf27a1E() unnamed_addr #1 {
start:
  tail call void asm sideeffect "", "r,r,r,r,r,r"(i32 0, i32 1, i32 2, i32 3, i32 4, i32 5) #3, !srcloc !515
; call asm::quux
  tail call fastcc void @_ZN3asm4quux17h6aa0bed8e5be8684E()
  unreachable
}

; asm::quux
; Function Attrs: noinline noreturn nounwind
define internal fastcc void @_ZN3asm4quux17h6aa0bed8e5be8684E() unnamed_addr #1 {
start:
  tail call void asm sideeffect "", "r,r,r,r,r,r,r"(i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6) #3, !srcloc !516
  br label %bb1

bb1:                                              ; preds = %bb1, %start
  br label %bb1
}

$ cargo objdump --example asm --release -- -d -no-show-raw-insn -print-imm-hex
Disassembly of section .text:
asm::foo::hab8803af3cab0482:
 8000400:       push    {r7, lr}
 8000402:       mov.w   r12, #0x0
 8000406:       mov.w   lr, #0x1
 800040a:       movs    r2, #0x2
 800040c:       movs    r3, #0x3
 800040e:       movs    r0, #0x4
 8000410:       movs    r1, #0x5
 8000412:       pop     {r7, pc}

asm::bar::h0d958609aa53168d:
 8000414:       mov.w   r12, #0x0
 8000418:       movs    r1, #0x1
 800041a:       movs    r2, #0x2
 800041c:       movs    r3, #0x3
 800041e:       movs    r0, #0x4
 8000420:       bx      lr

asm::baz::h3d14c7f076829055:
 8000422:       mov.w   r12, #0x0
 8000426:       mov.w   lr, #0x1 ; <-
 800042a:       movs    r2, #0x2
 800042c:       movs    r3, #0x3
 800042e:       movs    r0, #0x4
 8000430:       movs    r1, #0x5
 8000432:       bl      #0x2
 8000436:       trap

asm::quux::h485b4bd2341abdb3:
 8000438:       mov.w   r12, #0x0
 800043c:       mov.w   lr, #0x1 ; <-
 8000440:       movs    r2, #0x2
 8000442:       movs    r3, #0x3
 8000444:       movs    r0, #0x4
 8000446:       movs    r1, #0x5
 8000448:       movs    r4, #0x6
 800044a:       b       #-0x4 <asm::quux::h485b4bd2341abdb3+0x12>

I can't think of any way to solve this issue from our side given that libcore comes pre-compiled.

Answer 3 · 2018-10-27T21:02:33.000Z

Jumping in this thread. I saw this in an Embedded Rust workshop I ran at work the other day.

Answer 4 · 2018-12-10T02:54:53.000Z

It would be nice if the LLVM backend for ARM targets could be told to not trash LR when the noreturn attribute is used so that stack backtraces weren't broken in the debugger. What's more important, the little bit of speed/space savings from not spilling LR to the stack before modification or allowing a developer to actually be able to debug their code when such code is in the middle of their stack backtrace?

Answer 5 · 2018-12-10T11:01:26.000Z

AFAICT the difference between noreturn and no-noreturn is not small. The former will not stack any registers until all registers (~10) are used up; the latter will stack registers as soon as the scratch registers (3-5) are used up. I think we should not disable this optimization (in LLVM or in cortex-m-rt), or make it opt-in (makes too big of a diff in perf to have people "forget to enable it").

It would be nice if the LLVM backend for ARM targets could be told to not trash LR when the noreturn
attribute is used

If LLVM has this feature then I think it should be opt-out (i.e. enabled by default in the compiler but can be disabled via a compiler flag) for the Cortex-M targets. I haven't seen any switch like that though; I have only seen one for the frame pointer (-fno-omit-frame-pointer).

Answer 6 · 2018-12-10T23:37:37.000Z

I think we should not disable this optimization (in LLVM or in cortex-m-rt), or make it opt-in (makes too big of a diff in perf to have people "forget to enable it").

I wonder how big the performance hit really is as these divergent functions aren't going to be called in a hot loop since they never return to be called more than once.

To corrupt the registers so bad that a developer can't get a stack backtrace in the debugger is pretty bad behavior. It is made even worse by the fact that a lot of times these noreturn functions find themselves in the middle of a call stack that a developer would really be interested in looking at (HardFault, panics, etc).

Answer 7 · 2019-01-07T10:04:27.000Z

I will chime in and say that I have seen the same issue, which unfortunately makes postmortem debugging of embedded systems currently impossible and unteachable.

Answer 8 · 2019-01-08T20:15:37.000Z

Update from the WG meeting 2019-01-08

The main issue with this is that the lr (link register) is not pushed to the stack, and used as a scratch register for divergent functions
There is a question if this is an intentional LLVM optimization
We should collect if panic debugging worked, and when it stopped working
When we have data, we will make an upstream issue to get more input on how to solve it

Answer 9 · 2019-01-10T21:20:38.000Z

I was wondering that in the case of a secure OS which has the ability to load and execute untrusted payloads it would make sense not to corrupt these so the exception handlers can decode the data automatically, log it and kick out or restart the offending payload (depending on the OS policy a retry count could be added).

Also note that not trashing LR is essential since it might be possible that a hard fault occurs in the SVC handler because of an offending payload, so the OS should be able to go down the stack and identify the caller of the SVC.

This would make for a very good use case, supplemental to the debugging support.

Granted, probably for such a use case there is a need for stable inline asm (sorry if this is not the right name, I am unfamiliar with the feature names), and maybe even some support for custom linker symbols so the decoding can happen at runtime, automatically.

As for the resolution, at Rust code level, I agree, a per-function opt-in attribute for divergent functions would be a very elegant solution. An alternative might be that all exception handlers are, by default, not trashing LR, even if divergent, but that would probably mean some "special knowledge" in the compiler which will probably not be accepted in llvm.

Answer 10 · 2019-01-15T00:03:48.000Z

Noting that this is presumably the same problem as #158

Answer 11 · 2019-01-15T08:49:30.000Z

Yeah, with the HardfaultHandler is a divergent function it will not protect the link register, and any function call (or other use of LR) within the handler will break backtrace..

Answer 12 · 2019-01-25T15:39:17.000Z

Quick note if you are using a tool like gdbgui which automatically tries to query the stack frame on a crash (and ends up stuck recursively trying to query the corrupted stack frame), gdb does support limiting the max backtrace depth, which will keep the debugger from infinitely querying. Something like the following worked for me:

set backtrace limit 32

Edit: This can be set at command line, or in a batch file like .gdbinit or -x debug.gdb.

Answer 13 · 2019-01-25T19:19:43.000Z

set backtrace limit 32

Edit: This can be set at command line, or in a batch file like .gdbinit or -x debug.gdb

Would it make sense to have such a setting enabled by default in any bare metal project?

Answer 14 · 2019-07-22T13:19:10.000Z

Isn't it possible to do some linker tricks to avoid the noreturn?

fn panic(info: &PanicInfo) -> ! {
  if trick_the_optimizer() {
    unsafe { asm!{"ret"} }
  }
  loop {}
}

And then replace trick_the_optimizer during linking and maybe disable LTO if it is too good. Or maybe it is enough to mark it as extern "C" to inhibit optimizations.

Answer 15 · 2019-08-05T18:02:23.000Z

Isn't it possible to do some linker tricks to avoid the noreturn?

I tried various methods, including the one you suggested. Unfortunately, this does not seem to help because there are a couple other noreturn functions between fn panic(info: &PanicInfo) -> ! and the panic site, so LR ends up corrupted anyway.

I have found that one can break on core::panicking::panic instead of rust_begin_unwind and still have a consistent stack. You don't get the panic message because it hasn't been constructed yet, but you do get a complete and non-corrupted backtrace, which is much more valuable in my experience. Perhaps the .gdbinit should be changed so that this is the default.

Answer 16 · 2019-11-22T22:11:12.000Z

Is there an upstream issue for this?

Answer 17 · 2019-11-22T22:21:45.000Z

There have been discussions about this with one of the teams (at previous all hands), not sure if there is an issue

Answer 18 · 2020-02-17T10:18:19.000Z

There is a case in which break on core::panicking::panic does not help.
My code controls the PWM output for a power inverter. I run the code with only one breakpoint, as in case of an accidental stop, the device may be destructed.

The rust_begin_unwind contains code to correct stoping of device. After this code I set breakpoint and want to watch registers and memory. But now stack already corrupted!

Removing noreturn from a -> ! function seems to be the right solution for thumbv7m-none-eabi (also thumbv6m-none-eabi and et.al.)

Answer 19 · 2020-02-17T10:31:12.000Z

Submitted an upstream bug report: rust-lang/rust#69231

Answer 20 · 2020-02-17T12:54:33.000Z

Has nobody tried building with -Cforce-frame-pointers=true yet? Is there a minimal example showing the issue?

Answer 21 · 2020-02-17T13:29:05.000Z

Has nobody tried building with -Cforce-frame-pointers=true yet? Is there a minimal example showing the issue?

FYI it's -Cforce-frame-pointers=yes/no, but unfortunately it doesn't seem to make a difference to the backtrace.

I wasn't aware of a minimal example, so I put one together based on the quickstart that you can use. https://github.com/MabezDev/cortex-m-quickstart

Answer 22 · 2020-02-17T13:30:06.000Z

Great, thanks!

Answer 23 · 2020-02-17T20:07:40.000Z

Reposting here for anyone not following the upstream issue: rust-lang/rust#69231 (comment)

Forcing framepointers on fixes the backtrace issue, unfortunately the precompiled core doesn't force framepointers hence you will have to build your own with xargo or -Z build-std at the moment.

Answer 24 · 2020-02-20T12:44:45.000Z

This should be fixed on the current nightly

Answer 25 · 2020-02-26T20:07:57.000Z

Closing as fixed!

Note that according to rust-lang/rust#69231 (comment) there might be a cheaper way of achieving this, so if someone wants to go after that, feel free! (I won't be spending more time on this as we don't have any easy to use tooling for assessing changes like this)