spkenv/spk

spfs-monitor crashing due to "thread caused non-unwinding panic. aborting."

Closed this issue · 3 comments

Short version:

spfs-monitor is silently stopping because the cnproc library doesn't detect or properly handle some netlink message cases. The result in a misaligned pointer dereference and a crash. netlink is inherently unreliable for message delivery, but spfs-monitor probably shouldn't be.

Longer version:

I noticed spfs-monitor silently stopping shortly after launching a shell with spfs run or spfs shell. This stops the runtime data being cleaned up when the process using /spfs exits. I did some digging with an instrumented spfs-monitor and found a message behind it stopping: thread caused non-unwinding panic. aborting.

The panic came from a variety of similar errors:

thread 'tokio-runtime-worker' panicked at 'misaligned pointer dereference: address must be a multiple of 0x8 
but is 0x7f55bc000984', [myhomedir]/.cargo/git/checkouts/cnproc-rs-c25ca6aca5cb1b9d/07184c8/src/lib.rs:213:23

or

thread 'tokio-runtime-worker' panicked at 'misaligned pointer dereference: address must be a multiple of 0x8 
but is 0x7fa920001f54', [myhomedir]/.cargo/git/checkouts/cnproc-rs-c25ca6aca5cb1b9d/07184c8/src/lib.rs:218:23

or

thread 'tokio-runtime-worker' panicked at 'misaligned pointer dereference: address must be a multiple of 0x8 
but is 0x7fda580009c4', [myhomedir]/.cargo/git/checkouts/cnproc-rs-c25ca6aca5cb1b9d/07184c8/src/lib.rs:222:23

These are all in cnproc's parse_msg function. It is called from a recv() call in task spawned in spfs-monitor to read and (re)send events. It's in the monitor.rs file and contains while let Some(event) = monitor.recv() { where monitor was created by cnproc::PidMonitor::from_id(0).

Here are the relevant bits of the stacktrace:

...
  19:     0x55bb67794923 - core::panicking::panic_cannot_unwind::hfaa00866a3a05433
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:188:5
  20:     0x55bb68264778 - cnproc::parse_msg::h0915fef7ca2e1dab
                               at [myhomedir]/.cargo/git/checkouts/cnproc-rs-c25ca6aca5cb1b9d/07184c8/src/lib.rs:205:1
  21:     0x55bb682644b2 - cnproc::PidMonitor::get_events::h9c420d1eb93a7f0d
                               at [myhomedir]/.cargo/git/checkouts/cnproc-rs-c25ca6aca5cb1b9d/07184c8/src/lib.rs:175:58
  22:     0x55bb68264635 - cnproc::PidMonitor::recv::h1ac8ffffc5f8ec79
                               at [myhomedir]/.cargo/git/checkouts/cnproc-rs-c25ca6aca5cb1b9d/07184c8/src/lib.rs:195:19
  23:     0x55bb67811848 - spfs::monitor::wait_for_empty_runtime::{{closure}}::{{closure}}::hfe6ec0a2be1d22de
                               at [myhomedir]/github/spk/crates/spfs/src/monitor.rs:178:41
...                               

and parse_msg is:

unsafe fn parse_msg(header: *const nlmsghdr) -> Option<PidEvent> {
    let msg = (header as usize + nlmsg_length(0)) as *const cn_msg;
    if (*msg).id.idx != binding::CN_IDX_PROC || (*msg).id.val != binding::CN_VAL_PROC {
        return None;
    };
    let proc_ev = (*msg).data.as_ptr() as *const binding::proc_event;
    match (*proc_ev).what {
        binding::PROC_EVENT_FORK => {
            let pid = (*proc_ev).event_data.fork.child_pid;           <---- this in line 213
            let parent = (*proc_ev).event_data.fork.parent_pid;
            Some(PidEvent::Fork { parent, pid })
        }
        binding::PROC_EVENT_EXEC => {
            let pid = (*proc_ev).event_data.exec.process_pid;         <---- this is line 218
            Some(PidEvent::Exec(pid))
        }
        binding::PROC_EVENT_EXIT => {
            let pid = (*proc_ev).event_data.exit.process_pid;         <---- this is line 222
            Some(PidEvent::Exit(pid))
        }
        binding::PROC_EVENT_COREDUMP => {
            let pid = (*proc_ev).event_data.coredump.process_pid;
            Some(PidEvent::Coredump(pid))
        }
        _ => None,
    }
}

I used this, with a modified spfs-monitor binary already installed (from the attached branch) to replicate the crashes and see the error:

> env SPFS_MONITOR_FOREGROUND_LOGGING=1 spfs shell --runtime-name test ""

The shell/process starts up and it perfectly usable, but after a few seconds spfs-monitor is not longer running. You can verify this via spfs runtime list and watch the test runtime change to status=unmonitored, or you'll see the crash error output.

jrray commented

@rydrman my vote is to ditch cnproc since it doesn't (can't) provide 100% reliable notifications and we have a working alternative already.

I'm okay with ditching cnproc - it was an attempt to address concerns with scanning the whole process tree but if it's not reliable then it's not helpful

I'll take a look at removing it.