mvduin/arm-signal-unwind

Integrating minimally and without -fnon-call-exceptions

Closed this issue · 14 comments

We have our own toolchain and flags for an old ARMv5 processor, and we have our own signal handler that doesn't expect C++ exceptions to be thrown for synchronous signals. We want to compile with different flags than the ones in your Makefile, and we really only want to get ::backtrace() to see frames before glibc's internal signal handler.

So my question is, what changes would we have to make to build/use your code, and which flags in your Makefile are 100% necessary?

Also, is there a version of GCC/glibc that fixes this issue?

mvduin commented

I've not tested it on older cpus but I guess it should work on any arm-gnueabi-linux system. Instead of using my makefile you can just embed the source files (excluding test.cc) into your project. While the fix itself does not depend on any particular compiler flags (I think), the problem is that while my fix would allow for unwinding the signal handler stack frame, the next stack frame (the code interrupted by the signal) will usually be non-unwindable, even for synchronous signals unless that code has been compiled with -fnon-call-exceptions. Whether it's for a backtrace or throwing a C++ exception is unimportant, the unwinding mechanism used is the same. Additionally I'm using the -rdynamic linker flag to improve the output of backtrace().

To my knowledge the bug has never been fixed. There's been posts about it a few times, I've posted my code on the glibc mailing list to be used as guide on how to fix it in glibc itself (it's not a copy-paste patch), it would just require a small bit of effort and more importantly a willingness to jump through legal paperwork hoops (copyright assignment) to have the privilege of being allowed to submit a bugfix to a GNU project.

Thanks for your response. I'll give it a shot. Would you mind briefly explaining what the actual bug is that you have fixed?

Also, I just noticed that using -fnon-call-exceptions seems to fix our stack traces by itself, but I assume your patch handles more cases?

mvduin commented

I did some testing and you're right it does seem like backtrace() manages to work even without my fix... interesting, I don't think that was always the case? Perhaps I'm wrong about that, I don't remember for sure.

It's definitely puzzling why backtrace() would work while exceptions still don't, since they both rely on the same unwinding infrastructure... unless they hacked a special-case workaround into backtrace() to deal with signal stack frames instead of just actually fixing their unwinding? That would be incredibly dumb but I'm not sure how else to explain this. I guess it's possible it's using some kind of stack-scanning as a fallback, in which case using my fix would still be recommended since stack-scanning is inherently unreliable.

I suppose a more detailed explanation of the glibc bug and my fix would be a nice addition to the README... I'll see if can find the time to add one.

Keep in mind, with our other compiler flags, but without non-call-exceptions, backtrace would work much of the time but sometimes fail and return the same address over-and-over again (which might have been fixed by this, not sure). Now that I have added that flag, my test case is fixed, but then again, moving the crashing code around slightly has broken the backtrace in the past, so it could be possible we are both seeing sporadic successes and your patch really is important.

mvduin commented

I'm no longer puzzled why backtrace() works without my fix. I took another look at glibc's __default_sa_restorer which I mistakenly remembered as being just entirely non-unwindable, but it turned out to actually have enough information to support "virtual unwinding" which is all that's needed for backtraces.

My fix is therefore only needed if you want to throw C++ exceptions from signal handlers (whcih requires real unwinding, not merely virtual unwinding).

It sounds like the issue you were having has two parts:

  1. your signal handler interrupted non-unwindable code, thus causing backtrace() to hit a non-unwindable frame below the signal handler frame, and
  2. backtrace() failed to terminate when it hit the non-unwindable frame and instead ended up in an infinite loop, which is the libgcc bug you linked to.

Compiling with -fnon-call-exceptions should suffice to ensure code that may cause synchronous signals (e.g. memory accesses) is unwindable, thus hopefully this suffices to reliably fix the problem you were having. (An even stronger option would have been -fasynchronous-unwind-tables to allow unwinding literally anywhere, even from asynchronous signals, but unfortunately this option is not implemented on 32-bit ARM.)

Of course it's still always possible to encounter an unwindable stack frame somewhere in the call chain, e.g. C code that has been compiled with -fno-exceptions (though note that -fexceptions is the default on 32-bit ARM even for C code) or assembly code that hasn't been manually annotated with unwinding information. It is therefore still a good idea to ensure you're using a sufficiently recent libgcc that includes this bugfix.

Thank you so much! Two final questions, when you say "it is therefore still a good idea to ensure you're using a sufficiently recent libgcc that includes this bugfix," are you referring to the bug I linked? Because I thought that was a bug in libstdc++ (libsupc++). Second, I noticed that omitting either -mapcs-frame or -fno-omit-frame-pointer (while including -fnon-call-exceptions) both broke traces again. Is this b/c of an old version of libraries, or are both of these required? Note that the crashes I am testing with are in my own code compiled with -fnon-call-exceptions.

mvduin commented

Ah I saw the bug was tagged with "Component: libgcc" which makes sense since libgcc is where the unwinder code lives, but you're right I see the actual patch was done in libstdc++.

-fno-omit-frame-pointer ? -mapcs-frame ?? Wait are you still using OABI (the legacy ARM ABI) or something? Those options are definitely not relevant for unwinding on ARM EABI.

No, but our processor is really old. We are still building for ARMv5 of all things (arm9ej-s specifically). Those flags definitely change the output enough to cause traces to fail for crashes in the same function. If they shouldn't be relevant, then I can only imagine that it's an issue with the installed versions of libc/libstdc++, etc. I have upgraded our toolchains, but I haven't been able to update the installed libraries on the remote machines. Those were built back in 2015. Interestingly, building statically usually causes the traces to be worse, but maybe that's b/c when I built gcc from source I didn't build it with -fnon-call-exceptions? At any rate, I will have a chance to rebuild our toolchain and installed libraries soon.

mvduin commented

ARMv5 doesn't imply OABI, e.g. Debian 5 (Lenny) introduced an ARM EABI port for v4T/v5T/v6 and the old OABI port was discontinued in the next release since the new EABI is just better. The debian wiki lists a few ways to tell which ABI a binary has been compiled for, e.g. on armhf I get:

$ objdump -x /bin/true | grep 'private flags'
private flags = 5000400: [Version5 EABI] [hard-float ABI]
$ readelf -h /bin/true | grep Flags
  Flags:                             0x5000400, Version5 EABI, hard-float ABI

Anyway... if you are using OABI then I can't really help you, I've never worked with it and I have no idea how unwinding worked in those prehistoric times. Everything I've said in this thread is applicable to ARM EABI, but I have no idea whether any of it applies to OABI.

Oh yeah, sorry. I edited my post. It's EABI.

mvduin commented

Then what you're saying makes no sense... -mapcs-frame is some obsolete option and frame pointers are not relevant for unwinding on ARM EABI.

And only the code that contains the actual cpu instruction that crashes (and thus gets interrupted by the signal) needs to be compiled with -fnon-call-exceptions, remaining code only needs -fexceptions which should be enabled by default on ARM EABI.

Haha. Well, I'm just reporting what's happening. If I remove one of those, flags, I can get a bad/repeating stack trace for the same crash location. -fno-omit-frame-pointer kept the trace from breaking with -O2, and the other kept the trace from breaking just in general. But as I said before, I accepted the possibility that those were just sporadic failures due to changes in the compiled code, with the actual problem being more fundamental.

So, the idea is that I shouldn't need either of those flags, and therefore the unwind implementation I am linking with is buggy, and linking with your implementation could fix it in the meantime?

mvduin commented

My fix is not an unwind implementation, it just assists libgcc's unwind implementation with unwinding the signal stack frame (the stack frame created by the kernel for signal delivery) and, like I've already explained a few comments ago, is only needed if you want to throw an exception out of a signal handler. Since you only want to backtrace, this repository is not useful for you.

The stack trace going into an infinite loop is presumably the bug you mentioned hence should be fixable by updating libstdc++ (if you don't want to update the library on the target you could statically link to a newer version). But I don't really know what might be the cause of your problems, I am however inclined with your hypothesis that changing compiler flags is probably just affecting the outcome by randomly perturbing the generated code.

If you were using a newer core I'd be inclined to question whether your signals are actually synchronous, but I don't think ARM9EJ-S has any asynchronous exceptions whatsoever. So I guess it's possible you've got a buggy unwinder or a buggy compiler? I really don't know.

Ah well. I've learned a lot nonetheless. I'll close this for now. If I ever find out what is wrong, I'll come add another comment here in case someone else finds this thread. Thanks a lot.