Infinite backtrace, odd backtrace behavior and "corrupted stack?" in GDB
japaric opened this issue · 25 comments
STR
$ cargo generate --git ~/rust-embedded/cortex-m-quickstart --name app && cd app
$ cargo add panic-abort
$ # modify examples/panic.rs to use panic-abort
$ # change memory.x to match the memory layout of the STM32F3DISCOVERY
$ # uncomment thumbv7em-none-eabi target and gdb runner in .cargo/config
$ cargo run --example panic
(gdb) backtrace # "corrupt stack?"
#0 DefaultPreInit ()
at /home/japaric/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-rt-0.6.5/src/lib.rs:556
#1 0x08000412 in Reset ()
at /home/japaric/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-rt-0.6.5/src/lib.rs:496
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) continue
Breakpoint 4, main () at examples/panic.rs:29
29 panic!("Oops")
(gdb) backtrace # where did Reset go?
#0 main () at examples/panic.rs:29
(gdb) continue
Breakpoint 3, rust_begin_unwind (_info=0x10001fb4) at (..)
31 unsafe { intrinsics::abort() }
(gdb) backtrace # infinite backtrace
#0 rust_begin_unwind (_info=0x10001fb4) at (..)
#1 0x080006b0 in core::panicking::panic_fmt () at libcore/panicking.rs:77
#2 0x080006b0 in core::panicking::panic_fmt () at libcore/panicking.rs:77
#3 0x080006b0 in core::panicking::panic_fmt () at libcore/panicking.rs:77
(..)Metadata
$ rustc -V
rustc 1.31.0-nightly (f99911a4a 2018-10-23)
$ grep cortex-m-rt Cargo.lock
"cortex-m-rt 0.6.5 (registry+https://github.com/rust-lang/crates.io-index)"
$ arm-none-eabi-gdb --version
GNU gdb (GDB) 8.2Has anyone experienced these issues? I've seen them with previous releases of cortex-m-rt so I don't think they are related to the recent changes to HardFault. Also, in the above logs all backtrace invocations are before reaching HardFault.
I did see similar issues when I was running through "Chapter 6 - Hello, world!" of the Discovery book. I took a quick look at the time and seem to remember that the panic related functions were modifying important registers like LR without first saving them on the stack. I didn't really dig into it since I am a Rust noob and I figured that was the expected behavior for Rust's panic handling code. If you want, I can investigate this. I don't think my HardFault change has anything to do with this but I can make sure that is the case and I think they are probably a similar type of issue.
You are correct. core::panicking::panic modifies LR without first pushing it onto the stack.
$ cd $(rustc --print sysroot)
$ arm-none-eabi-objdump -Cd lib/rustlib/thumbv7m-none-eabi/lib/libcore-*.rlib
(..)
00000000 <core::panicking::panic>:
0: b08c sub sp, #48 ; 0x30
2: e890 1006 ldmia.w r0, {r1, r2, ip}
6: e9d0 3e03 ldrd r3, lr, [r0, #12]
a: 6940 ldr r0, [r0, #20]
c: e9cd 1206 strd r1, r2, [sp, #24]
10: a906 add r1, sp, #24
12: f240 0200 movw r2, #0
16: 9100 str r1, [sp, #0]
18: 2101 movs r1, #1
1a: f2c0 0200 movt r2, #0
1e: 9101 str r1, [sp, #4]
20: 2100 movs r1, #0
22: e9cd 1102 strd r1, r1, [sp, #8]
26: e9cd 2104 strd r2, r1, [sp, #16]
2a: a908 add r1, sp, #32
2c: e9cd c308 strd ip, r3, [sp, #32]
30: e9cd e00a strd lr, r0, [sp, #40] ; 0x28
34: 4668 mov r0, sp
36: f7ff fffe bl 0 <core::panicking::panic>
3a: defe udf #254 ; 0xfeIt seems that all divergent functions (fn(..) -> !) are given the noreturn attribute in LLVM-IR and this gives LLVM permission to thrash the LR register when optimizing the code. See example below:
#![feature(asm)]
#![no_main]
#![no_std]
extern crate panic_abort;
use cortex_m_rt::entry;
#[entry]
fn main() -> ! {
foo();
bar();
baz()
}
#[inline(never)]
fn foo() {
unsafe {
asm!("" :: "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) :: "volatile");
}
}
#[inline(never)]
fn bar() {
unsafe {
asm!("" :: "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) :: "volatile");
}
}
#[inline(never)]
fn baz() -> ! {
unsafe {
asm!("" :: "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) :: "volatile");
}
quux()
}
#[inline(never)]
fn quux() -> ! {
unsafe {
asm!("" :: "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) "r"(6) :: "volatile");
}
loop {}
}$ cargo rustc --example asm --release -- --emit=llvm-ir
$ cat $(find -name '*.ll')
; asm::foo
; Function Attrs: noinline nounwind
define internal fastcc void @_ZN3asm3foo17h42aa59da8b4484e3E() unnamed_addr #0 {
start:
tail call void asm sideeffect "", "r,r,r,r,r,r"(i32 0, i32 1, i32 2, i32 3, i32 4, i32 5) #3, !srcloc !513
ret void
}
; asm::bar
; Function Attrs: noinline nounwind
define internal fastcc void @_ZN3asm3bar17h6e8a9117df8b0663E() unnamed_addr #0 {
start:
tail call void asm sideeffect "", "r,r,r,r,r"(i32 0, i32 1, i32 2, i32 3, i32 4) #3, !srcloc !514
ret void
}
; asm::baz
; Function Attrs: noinline noreturn nounwind
define internal fastcc void @_ZN3asm3baz17h32a1e9952bcf27a1E() unnamed_addr #1 {
start:
tail call void asm sideeffect "", "r,r,r,r,r,r"(i32 0, i32 1, i32 2, i32 3, i32 4, i32 5) #3, !srcloc !515
; call asm::quux
tail call fastcc void @_ZN3asm4quux17h6aa0bed8e5be8684E()
unreachable
}
; asm::quux
; Function Attrs: noinline noreturn nounwind
define internal fastcc void @_ZN3asm4quux17h6aa0bed8e5be8684E() unnamed_addr #1 {
start:
tail call void asm sideeffect "", "r,r,r,r,r,r,r"(i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6) #3, !srcloc !516
br label %bb1
bb1: ; preds = %bb1, %start
br label %bb1
}
$ cargo objdump --example asm --release -- -d -no-show-raw-insn -print-imm-hex
Disassembly of section .text:
asm::foo::hab8803af3cab0482:
8000400: push {r7, lr}
8000402: mov.w r12, #0x0
8000406: mov.w lr, #0x1
800040a: movs r2, #0x2
800040c: movs r3, #0x3
800040e: movs r0, #0x4
8000410: movs r1, #0x5
8000412: pop {r7, pc}
asm::bar::h0d958609aa53168d:
8000414: mov.w r12, #0x0
8000418: movs r1, #0x1
800041a: movs r2, #0x2
800041c: movs r3, #0x3
800041e: movs r0, #0x4
8000420: bx lr
asm::baz::h3d14c7f076829055:
8000422: mov.w r12, #0x0
8000426: mov.w lr, #0x1 ; <-
800042a: movs r2, #0x2
800042c: movs r3, #0x3
800042e: movs r0, #0x4
8000430: movs r1, #0x5
8000432: bl #0x2
8000436: trap
asm::quux::h485b4bd2341abdb3:
8000438: mov.w r12, #0x0
800043c: mov.w lr, #0x1 ; <-
8000440: movs r2, #0x2
8000442: movs r3, #0x3
8000444: movs r0, #0x4
8000446: movs r1, #0x5
8000448: movs r4, #0x6
800044a: b #-0x4 <asm::quux::h485b4bd2341abdb3+0x12>I can't think of any way to solve this issue from our side given that libcore comes pre-compiled.
Jumping in this thread. I saw this in an Embedded Rust workshop I ran at work the other day.
It would be nice if the LLVM backend for ARM targets could be told to not trash LR when the noreturn attribute is used so that stack backtraces weren't broken in the debugger. What's more important, the little bit of speed/space savings from not spilling LR to the stack before modification or allowing a developer to actually be able to debug their code when such code is in the middle of their stack backtrace?
AFAICT the difference between noreturn and no-noreturn is not small. The former will not stack any registers until all registers (~10) are used up; the latter will stack registers as soon as the scratch registers (3-5) are used up. I think we should not disable this optimization (in LLVM or in cortex-m-rt), or make it opt-in (makes too big of a diff in perf to have people "forget to enable it").
It would be nice if the LLVM backend for ARM targets could be told to not trash LR when the noreturn
attribute is used
If LLVM has this feature then I think it should be opt-out (i.e. enabled by default in the compiler but can be disabled via a compiler flag) for the Cortex-M targets. I haven't seen any switch like that though; I have only seen one for the frame pointer (-fno-omit-frame-pointer).
I think we should not disable this optimization (in LLVM or in cortex-m-rt), or make it opt-in (makes too big of a diff in perf to have people "forget to enable it").
I wonder how big the performance hit really is as these divergent functions aren't going to be called in a hot loop since they never return to be called more than once.
To corrupt the registers so bad that a developer can't get a stack backtrace in the debugger is pretty bad behavior. It is made even worse by the fact that a lot of times these noreturn functions find themselves in the middle of a call stack that a developer would really be interested in looking at (HardFault, panics, etc).
I will chime in and say that I have seen the same issue, which unfortunately makes postmortem debugging of embedded systems currently impossible and unteachable.
Update from the WG meeting 2019-01-08
- The main issue with this is that the
lr(link register) is not pushed to the stack, and used as a scratch register for divergent functions - There is a question if this is an intentional LLVM optimization
- We should collect if panic debugging worked, and when it stopped working
- When we have data, we will make an upstream issue to get more input on how to solve it
I was wondering that in the case of a secure OS which has the ability to load and execute untrusted payloads it would make sense not to corrupt these so the exception handlers can decode the data automatically, log it and kick out or restart the offending payload (depending on the OS policy a retry count could be added).
Also note that not trashing LR is essential since it might be possible that a hard fault occurs in the SVC handler because of an offending payload, so the OS should be able to go down the stack and identify the caller of the SVC.
This would make for a very good use case, supplemental to the debugging support.
Granted, probably for such a use case there is a need for stable inline asm (sorry if this is not the right name, I am unfamiliar with the feature names), and maybe even some support for custom linker symbols so the decoding can happen at runtime, automatically.
As for the resolution, at Rust code level, I agree, a per-function opt-in attribute for divergent functions would be a very elegant solution. An alternative might be that all exception handlers are, by default, not trashing LR, even if divergent, but that would probably mean some "special knowledge" in the compiler which will probably not be accepted in llvm.
Yeah, with the HardfaultHandler is a divergent function it will not protect the link register, and any function call (or other use of LR) within the handler will break backtrace..
Quick note if you are using a tool like gdbgui which automatically tries to query the stack frame on a crash (and ends up stuck recursively trying to query the corrupted stack frame), gdb does support limiting the max backtrace depth, which will keep the debugger from infinitely querying. Something like the following worked for me:
set backtrace limit 32
Edit: This can be set at command line, or in a batch file like .gdbinit or -x debug.gdb.
set backtrace limit 32
Edit: This can be set at command line, or in a batch file like .gdbinit or -x debug.gdb
Would it make sense to have such a setting enabled by default in any bare metal project?
Isn't it possible to do some linker tricks to avoid the noreturn?
fn panic(info: &PanicInfo) -> ! {
if trick_the_optimizer() {
unsafe { asm!{"ret"} }
}
loop {}
}And then replace trick_the_optimizer during linking and maybe disable LTO if it is too good. Or maybe it is enough to mark it as extern "C" to inhibit optimizations.
Isn't it possible to do some linker tricks to avoid the
noreturn?
I tried various methods, including the one you suggested. Unfortunately, this does not seem to help because there are a couple other noreturn functions between fn panic(info: &PanicInfo) -> ! and the panic site, so LR ends up corrupted anyway.
I have found that one can break on core::panicking::panic instead of rust_begin_unwind and still have a consistent stack. You don't get the panic message because it hasn't been constructed yet, but you do get a complete and non-corrupted backtrace, which is much more valuable in my experience. Perhaps the .gdbinit should be changed so that this is the default.
Is there an upstream issue for this?
There have been discussions about this with one of the teams (at previous all hands), not sure if there is an issue
There is a case in which break on core::panicking::panic does not help.
My code controls the PWM output for a power inverter. I run the code with only one breakpoint, as in case of an accidental stop, the device may be destructed.
The rust_begin_unwind contains code to correct stoping of device. After this code I set breakpoint and want to watch registers and memory. But now stack already corrupted!
Removing noreturn from a -> ! function seems to be the right solution for thumbv7m-none-eabi (also thumbv6m-none-eabi and et.al.)
Submitted an upstream bug report: rust-lang/rust#69231
Has nobody tried building with -Cforce-frame-pointers=true yet? Is there a minimal example showing the issue?
Has nobody tried building with -Cforce-frame-pointers=true yet? Is there a minimal example showing the issue?
FYI it's -Cforce-frame-pointers=yes/no, but unfortunately it doesn't seem to make a difference to the backtrace.
I wasn't aware of a minimal example, so I put one together based on the quickstart that you can use. https://github.com/MabezDev/cortex-m-quickstart
Great, thanks!
Reposting here for anyone not following the upstream issue: rust-lang/rust#69231 (comment)
Forcing framepointers on fixes the backtrace issue, unfortunately the precompiled core doesn't force framepointers hence you will have to build your own with xargo or -Z build-std at the moment.
This should be fixed on the current nightly
Closing as fixed!
Note that according to rust-lang/rust#69231 (comment) there might be a cheaper way of achieving this, so if someone wants to go after that, feel free! (I won't be spending more time on this as we don't have any easy to use tooling for assessing changes like this)