bombela/backward-cpp

backward-cpp is not signal safe and can dead-lock (resolved: known issue, won't be fixed)

mmurrian opened this issue ยท 15 comments

Nonreentrant functions should not be used in signal handlers. https://www.gnu.org/software/libc/manual/html_node/Nonreentrancy.html

I notice that StackTrace uses a std::vector<void *> for storage that is dynamically allocated inside the signal handler (when load_here/load_from is called in handleSignal). Generally, storage in a signal handler should be allocated on the stack.

How do you properly handle reentrancy? It's not clear to me that you do.

Yes it will usually deadlock when a signal interrupted the process/thread while it was within the malloc functions. In practice this happens when there is memory corruption messing around with the malloc structures and indirectly triggering a page fault.

At this point your program is already done for anyways.

With that said, of course it would be better to be signal safe for those cases. Which as you pointed out, is mostly about avoiding malloc. All of the libraries used to resolve the traces are not signal safe though. So it would require to write one that never allocates.

In practice it seems to be an acceptable trade of.

See also #127

Understood. Thanks for confirming.

I need my "crash handler" to never, itself, crash. So this isn't an acceptable trade-off for me. But that's not your problem.

Overall, this is a well-written project. Thanks again.

You are already crashed past the point of no return if your signal comes from a page fault as a result of memory corruption. You lost all guarantee about the state of the program.

And since this was a memory corruption, even if the signal doesn't crash, the stacktrace will never tell you where the original corruption came from.

If you want to avoid memory corruptions in the first place; with the same control over the machine as C++; I strongly recommend the safer programming language https://www.rust-lang.org/.

Deadlocking is an unacceptable trade-off to any other possible outcome for me. That's pretty straightforward. shrug

My point was that with a memory corruption, anything can happen. Including deadlock with or without backward-cpp.

I am pretty sure the chance of deadlock because of a memory corruption is lower without backward-cpp trying to print a stack trace after a crash in malloc. But it can still happen.

I presume that using C++ instead of a safer language is usually for performances and/or interoperability with C. Rust does that too, with the added benefit of much stronger gaurantees (and auditable) memory safety if that is really important to you.

If course I also don't know your problem domain. Maybe a lower probability of deadlock is all you need to meet your requirements.

I truly don't understand why you're trying to sell me on Rust right now. That's just obnoxious.

Because of asking for guarantees. Rust is nowadays one of the best practical ways to get some guarantees without having any kind of language runtime or a vm. (It is not panacea, sure. Java software also gets CVEs despite eliminating whole classes of bugs that C and C++ programs tend to suffer from.)

Increasing the likelihood of deadlocks is truly unfortunate, though. The obvious ideas of what to do here would be

  1. Set a dead-person switch to kill the process in 5s after signal comes, whatever else might happen. This way the deadlock will only last 5 seconds at most.
  2. Shunt the processing out of the signal handler as soon as possible. Something like https://docs.rs/signal-hook/latest/signal_hook/. Have a waiting thread ready to do the dumping, signal handler will fill some global data with what was captured and flip atomic variable to let the dumper thread do its work https://docs.rs/signal-hook/latest/signal_hook/low_level/channel/index.html#how-does-it-work. Threads do take memory, but it is not too bad.
  3. Accept deadlocks and rely on k8s health checks, etc. to kill what misbehaves.

(and yes, I am now reading some of the other issues and PRs with people smarter and more knowledgeable than me are proposing still more such things.

edit: and putting together a pitch for other folks on project, saying "yes, it will cause some deadlocks, but on balance it will be real good.)

One more obvious idea here would be to fix the bug. That is fundamentally what we're talking about here... it's just a bug.

It is technically possible to implement stack walking and symbols resolving fully signal safe. But this is annoyingly difficult. You have to allocate memory at the start of the program and then write all the code in the stack walker, debug symbol parser/decoder/interpreter and demangling of symbols without ever touching anything but the initially reserved block of memory. Basically, you have to rewrites all the libraries that backward-cpp uses. The amount of work, and the fact that it is easy to miss something means that nobody bothers doing that.

A common alternative is to start a monitor process at the start of the main program. The monitor process will behave like a debugger, and do all the work in its own address space. I think I remember Google Chrome would do something like that. Though when I read the code, they also have an embedded stack printer, which is also not signal safe (it can deadlock via malloc during signal de-mangling): https://chromium.googlesource.com/chromium/src/base/+/refs/heads/main/debug/stack_trace_posix.cc

A common alternative is to start a monitor process at the start of the main program.

I'm approaching this from the point of an application running in OpenShift, which may not have SYS_PTRACE capability. I am not yet entirely clear for what operations this is necessary and for which it isn't, but for example leak sanitizer needs it. google/sanitizers#916

edit: or maybe it doesn't, I can get lsan to work with docker run --cap-drop ALL. ๐Ÿ˜•

Yes, coding can be challenging. The Google code you mentioned has a few comments specifically addressing this. BLUF: They don't have this bug.

Their comment before the signal-unsafe function in question:
// Note: code in this function is NOT async-signal safe (std::string uses // malloc internally).

And here's where they don't call it in a signal handler.
// Below part is async-signal unsafe (uses malloc), so execute it only // when we are not executing the signal handler.

https://chromium.googlesource.com/chromium/src/base/+/refs/heads/main/debug/stack_trace_posix.cc#242

You might also notice the WarmUpBacktrace() function that they use to get some internal glibc memory allocations out of the way before entering the signal context. Their code has provisions throughout specifically to avoid malloc in a signal handler.

Yes, coding can be challenging.

Simplicity is also a feature. A good 90 % solution has a right to exist on its own. Trying to make it 100% will kill some of its good parts. Any attempt to "fix the bug" the "good old safe coding way going by the book and never allocating" should be done as its own forked project, IMO.

"perfect is the enemy of good" https://yarchive.net/comp/linux/tuning_parameters.html

edit: and I do understand that there are those situations where "if you are one bit off, you are completely wrong as a result", and this is, in a sense, one of those. But still, I tend to say: ๐Ÿคท๐Ÿฝ.

They don't have this bug.

If so, good for em! But their backtrace is not so fancy like backward-cpp can do.

Yes they don't have the bug because... they don't resolve the trace :) It merely prints the list of memory addresses and the raw symbols. You can do that trivially yourself, you do not need backward-cpp for that. You can then use addr2line to resolve the location in the source code (of course you must make sure the source code matches the version of the binary).

The other traditional approach is to save a coredump to analyze offline with a debugger later.

I would be interested in a facility like boost::stacktrace::safe_dump_to and boost::stacktrace::stacktrace::from_dump that allows me to write the trace in a async-signal-safe way and then when the program restarts, reads any saved dumps and resolves and parses the stack trace at that point. This can be useful in a lot of services that get automatically restarted on crash.