Luminarys/synapse

Unwrapping None while attempting a clean shutdown

Opened this issue · 9 comments

After ctrl-c, I get caught on this unwrap:

let r = self.resources.remove(&id).unwrap();

Stack trace below. Any ideas on where to track down this bug? Or is it just an unwrap that should be an if let Some(r) = ...?

05/24/20 21:50:51 [synapse::init:128] I: Shutting down cleanly. Interrupt again to shut down immediately.
thread 'rpc' panicked at 'called `Option::unwrap()` on a `None` value', src/rpc/processor.rs:624:33
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/libunwind.rs:86
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:78
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1069
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1504
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:198
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:218
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:511
  11: rust_begin_unwind
             at src/libstd/panicking.rs:419
  12: core::panicking::panic_fmt
             at src/libcore/panicking.rs:111
  13: core::panicking::panic
             at src/libcore/panicking.rs:54
  14: core::option::Option<T>::unwrap
             at /rustc/b2e36e6c2d229126b59e892c9147fbb68115d292/src/libcore/macros/mod.rs:34
  15: synapse::rpc::processor::Processor::handle_ctl
             at src/rpc/processor.rs:624
  16: synapse::rpc::RPC::handle_ctl
             at src/rpc/mod.rs:259
  17: synapse::rpc::RPC::run
             at src/rpc/mod.rs:239
  18: synapse::rpc::RPC::start::{{closure}}
             at src/rpc/mod.rs:207
  19: synapse::handle::Handle<I,O>::run::{{closure}}
             at src/handle.rs:39
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
05/24/20 21:50:52 [synapse::init:41] E: Unclean shutdown detected, terminating

I think it's useful to keep that unwrap there, or at least report an error if it's handled via if-let.

Tracking this down isn't so simple since it requires examining every codepath that adds a resource to the RPC handler and then thinking about how a shutdown could trigger an out of order delete before add (or just failure to add). Is this a bug you see frequently occurring?

it happens every time currently. it must be serialized in a cache file then? I'll see about printing more information during the None case

This helps with tracking things down then, maybe try printing the ID and then in all the rpc_id generating functions match over that ID and see where it's coming from.

Ok I have more information now that I'm printing the non-present ids:

05/25/20 21:51:36 [synapse::init:128] I: Shutting down cleanly. Interrupt again to shut down immediately.
DBG: id not found: "D3EB14FAF8EADF95AB1E00C588F5A9E47A384CEC"
DBG: self.resources.len() 416
DBG: id not found: "7041EEC6BA6F687BDE3F2641D50097F1E2B7F0AE"
DBG: self.resources.len() 416
DBG: id not found: "2966614EDB1EF630E251664D52BDBA7EB6A522DD"
DBG: self.resources.len() 416
DBG: id not found: "26D3E02B33079FE14825D6BC5773B210FB589876"
DBG: self.resources.len() 216
DBG: id not found: "2DE4731620D5E44F55DA77849A069EEC21AEC575"
DBG: self.resources.len() 216
DBG: id not found: "3B2A920A1EF7664A08AEB8299AFE449832DCC527"
DBG: self.resources.len() 216
DBG: id not found: "5B15FFC79AED3C26B650747C6BFC6B1E99523AE0"
DBG: self.resources.len() 31
DBG: id not found: "DA8AFC017D7A1BDB0F128259991DCC78FB3073CF"
DBG: self.resources.len() 31
DBG: id not found: "07098728BD8EB60DAD3773D0EA2BBF4C6D5D9A5A"
DBG: self.resources.len() 31
DBG: id not found: "A006F7F78A5CAA5EBB13561454DF7C29692F14FD"
DBG: self.resources.len() 22
DBG: id not found: "06C3E555F55C295948E788DC6F55274EBCFA0045"
DBG: self.resources.len() 22
DBG: id not found: "D71C9FBB0BDEBAF020371C9BD48616E6F8E7C60F"
DBG: self.resources.len() 22
05/25/20 21:51:39 [synapse::init:45] I: Shutdown complete

Looks like there's always 3 failures in a row. The first failure in each set of 3 is a valid torrent that I have seeding. I don't know what the other two are.

Also, I get a different set of IDs each run. This seems like a race condition of some sort, where perhaps another thread has dumped the same resource

here's the call that is sending the Removed request:

self.cio.msg_rpc(rpc::CtlMessage::Removed(r));

Looking into this it seems like either a file or tracker wasn't properly added. Can you add some tracing to the file_rpc_id/tracker_rpc_id functions and see if they're called with those hashes. My guess is that it's files since there's a conditional clause around magnets for it, but looking through that logic I don't see what's wrong. Can you try to identify the torrent that causes this and give some basic information about its state?

I added a trace to send_rpc_removal to see what it was requesting to remove, and it is sending a removal request twice. See this trace, this set of 3 was removed successfully, then about 20 requests later sent again.

DBG: sending rpc msg: removing ["1B78D8A59F99F137E76CD57DFCCB3263D366FC1C", "78A81F7C24D41FFEF8553B57BA071C98523AE4A2", "1CD1D6C1B28D6F4A7E1BBDF6DB28074353EA89C0"]
<snip>
DBG: sending rpc msg: removing ["1B78D8A59F99F137E76CD57DFCCB3263D366FC1C", "78A81F7C24D41FFEF8553B57BA071C98523AE4A2", "1CD1D6C1B28D6F4A7E1BBDF6DB28074353EA89C0"]
DBG: id not found: "1B78D8A59F99F137E76CD57DFCCB3263D366FC1C"
DBG: self.resources.len() 138
DBG: id not found: "78A81F7C24D41FFEF8553B57BA071C98523AE4A2"
DBG: self.resources.len() 138
DBG: id not found: "1CD1D6C1B28D6F4A7E1BBDF6DB28074353EA89C0"
DBG: self.resources.len() 138

I wonder how it's possible that this is being dropped twice since that's the only time this code is called. I'll have to do some more thinking, but I recommend you add some more tracing around the object to get more context around how this is happening.

So I was looking at this again yesterday, and couldn't reproduce the issue. Now I'm really not sure how to proceed!

It could have been induced by a tracker not giving expected input, or a peer, or anything else.

At any rate, I'm not having unclean shutdown issues anymore.