oxidecomputer/propolis

accessor Node::poison deadlocks with racing drop

Closed this issue · 0 comments

I had a propolis-standalone instance deadlock while destroying a VM during halt. The offending stack is as follows:

stack pointer for thread 39 [state loop]: fffffc7fe39ff8e0
[ fffffc7fe39ff8e0 libc.so.1`__lwp_park+0x17() ]
  fffffc7fe39ff930 libc.so.1`mutex_lock_impl+0x113(1bcd790, 0)
  fffffc7fe39ff950 libc.so.1`mutex_lock+0x13(1bcd790)
  fffffc7fe39ffb80 <propolis::accessors::Node<T> as core::ops::drop::Drop>::drop::h4f759e432fa68c42+0x90()
  fffffc7fe39ffba0 alloc::sync::Arc<T,A>::drop_slow::h7e635989f14d1904+0x12()
  fffffc7fe39ffca0 propolis::accessors::Accessor<T>::poison::h4c27da6f9d986cc6+0x563()
  fffffc7fe39ffcc0 propolis::vmm::machine::Machine::do_destroy::h4bf1bb9dda81db12+0x24()
  fffffc7fe39ffd00 <propolis::vmm::machine::Machine as core::ops::drop::Drop>::drop::hf798d7b8b32cf6f4+0x14()
  fffffc7fe39ffe70 propolis_standalone::Instance::state_loop::h9bb359840508caaa+0xecf()
  fffffc7fe39ffed0 std::sys_common::backtrace::__rust_begin_short_backtrace::h55883990d859e2f8 +0x3e()
  fffffc7fe39fff60 core::ops::function::FnOnce::call_once{{vtable.shim}}::h1f25fcf57c71bbd5+0x85()
  fffffc7fe39fffb0 std::sys::unix::thread::Thread::new::thread_start::ha7be144980000e90+0x29()
  fffffc7fe39fffe0 libc.so.1`_thrp_setup+0x77(fffffc7feecb2a40)
  fffffc7fe39ffff0 libc.so.1`_lwp_start()

This is in poison:

    fn poison(&self) -> Option<Arc<T>> {
        self.lock_tree(|mut tguard, ent| {
            let id = ent.id;
            drop(ent);
            if id != ID_ROOT {
                drop(tguard);
                panic!("tree poisoning only allowed at root");
            }

            // Remove the resource from the tree...
            let top_res = tguard.resource_root.take();
            // ... and poison all nodes too
            if top_res.is_some() {
                tguard.for_each(ID_ROOT, |_id, tnode| {
                    if let Some(node) = tnode.node_ref.upgrade() {
                        node.0.lock().unwrap().resource.take();
                    }
                });
            }

            top_res
        })

When take()-ing the resource from nodes in the tree during the for_each(), if the node in question is dropped by its holder, then when the local copy we got from the upgrade() call is dropped, we'll find ourself in Node::drop which will attempt to lock the tree (which is already held) and deadlock.