spfs-monitor crashing due to "thread caused non-unwinding panic. aborting."
Closed this issue · 3 comments
Short version:
spfs-monitor
is silently stopping because the cnproc
library doesn't detect or properly handle some netlink message cases. The result in a misaligned pointer dereference and a crash. netlink is inherently unreliable for message delivery, but spfs-monitor
probably shouldn't be.
Longer version:
I noticed spfs-monitor
silently stopping shortly after launching a shell with spfs run
or spfs shell
. This stops the runtime data being cleaned up when the process using /spfs exits. I did some digging with an instrumented spfs-monitor
and found a message behind it stopping: thread caused non-unwinding panic. aborting.
The panic came from a variety of similar errors:
thread 'tokio-runtime-worker' panicked at 'misaligned pointer dereference: address must be a multiple of 0x8
but is 0x7f55bc000984', [myhomedir]/.cargo/git/checkouts/cnproc-rs-c25ca6aca5cb1b9d/07184c8/src/lib.rs:213:23
or
thread 'tokio-runtime-worker' panicked at 'misaligned pointer dereference: address must be a multiple of 0x8
but is 0x7fa920001f54', [myhomedir]/.cargo/git/checkouts/cnproc-rs-c25ca6aca5cb1b9d/07184c8/src/lib.rs:218:23
or
thread 'tokio-runtime-worker' panicked at 'misaligned pointer dereference: address must be a multiple of 0x8
but is 0x7fda580009c4', [myhomedir]/.cargo/git/checkouts/cnproc-rs-c25ca6aca5cb1b9d/07184c8/src/lib.rs:222:23
These are all in cnproc's parse_msg
function. It is called from a recv()
call in task spawned in spfs-monitor
to read and (re)send events. It's in the monitor.rs file and contains while let Some(event) = monitor.recv() {
where monitor
was created by cnproc::PidMonitor::from_id(0)
.
Here are the relevant bits of the stacktrace:
...
19: 0x55bb67794923 - core::panicking::panic_cannot_unwind::hfaa00866a3a05433
at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:188:5
20: 0x55bb68264778 - cnproc::parse_msg::h0915fef7ca2e1dab
at [myhomedir]/.cargo/git/checkouts/cnproc-rs-c25ca6aca5cb1b9d/07184c8/src/lib.rs:205:1
21: 0x55bb682644b2 - cnproc::PidMonitor::get_events::h9c420d1eb93a7f0d
at [myhomedir]/.cargo/git/checkouts/cnproc-rs-c25ca6aca5cb1b9d/07184c8/src/lib.rs:175:58
22: 0x55bb68264635 - cnproc::PidMonitor::recv::h1ac8ffffc5f8ec79
at [myhomedir]/.cargo/git/checkouts/cnproc-rs-c25ca6aca5cb1b9d/07184c8/src/lib.rs:195:19
23: 0x55bb67811848 - spfs::monitor::wait_for_empty_runtime::{{closure}}::{{closure}}::hfe6ec0a2be1d22de
at [myhomedir]/github/spk/crates/spfs/src/monitor.rs:178:41
...
and parse_msg
is:
unsafe fn parse_msg(header: *const nlmsghdr) -> Option<PidEvent> {
let msg = (header as usize + nlmsg_length(0)) as *const cn_msg;
if (*msg).id.idx != binding::CN_IDX_PROC || (*msg).id.val != binding::CN_VAL_PROC {
return None;
};
let proc_ev = (*msg).data.as_ptr() as *const binding::proc_event;
match (*proc_ev).what {
binding::PROC_EVENT_FORK => {
let pid = (*proc_ev).event_data.fork.child_pid; <---- this in line 213
let parent = (*proc_ev).event_data.fork.parent_pid;
Some(PidEvent::Fork { parent, pid })
}
binding::PROC_EVENT_EXEC => {
let pid = (*proc_ev).event_data.exec.process_pid; <---- this is line 218
Some(PidEvent::Exec(pid))
}
binding::PROC_EVENT_EXIT => {
let pid = (*proc_ev).event_data.exit.process_pid; <---- this is line 222
Some(PidEvent::Exit(pid))
}
binding::PROC_EVENT_COREDUMP => {
let pid = (*proc_ev).event_data.coredump.process_pid;
Some(PidEvent::Coredump(pid))
}
_ => None,
}
}
I used this, with a modified spfs-monitor
binary already installed (from the attached branch) to replicate the crashes and see the error:
> env SPFS_MONITOR_FOREGROUND_LOGGING=1 spfs shell --runtime-name test ""
The shell/process starts up and it perfectly usable, but after a few seconds spfs-monitor
is not longer running. You can verify this via spfs runtime list
and watch the test
runtime change to status=unmonitored
, or you'll see the crash error output.
@rydrman my vote is to ditch cnproc since it doesn't (can't) provide 100% reliable notifications and we have a working alternative already.
I'm okay with ditching cnproc - it was an attempt to address concerns with scanning the whole process tree but if it's not reliable then it's not helpful
I'll take a look at removing it.