rust-lang/rust

Panic in Receiver::recv()

jamespharaoh opened this issue Β· 61 comments

I'm having a problem using a channel to send notifications to a background thread... As I understand it, this shouldn't be possible? Receiver::recv() is not supposed to panic, is it? I'm not doing anything unsafe...

Mentioned this on reddit and burntsushi has confirmed that this looks like a bug. I will try and produce some simpler code to reproduce this but don't have time at the moment. (edit - please nag me if i don't produce an example, I have a vague idea what will do it)

I've tried this with stable (1.14) and nightly and get the same result.

A copy of the code which generates the error is available here:

https://github.com/jamespharaoh/rust-output/tree/channel_recv_panic

thread '<unnamed>' panicked at 'internal error: entered unreachable code', ../src/libstd/sync/mpsc/mod.rs:879
stack backtrace:
   1:     0x55ab04830c1a - std::sys::imp::backtrace::tracing::imp::write::h917062bce4ff48c3
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/obj/../src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:42
   2:     0x55ab0483672f - std::panicking::default_hook::{{closure}}::h0bacac31b5ed1870
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/obj/../src/libstd/panicking.rs:247
   3:     0x55ab04834666 - std::panicking::default_hook::h5897799da33ece67
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/obj/../src/libstd/panicking.rs:263
   4:     0x55ab04834d17 - std::panicking::rust_panic_with_hook::h109e116a3a861224
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/obj/../src/libstd/panicking.rs:451
   5:     0x55ab045b4d73 - std::panicking::begin_panic::h634e2b37a96f78d4
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/obj/../src/libstd/panicking.rs:413
   6:     0x55ab045b9ce0 - <std::sync::mpsc::Receiver<T>>::recv::h59b94a6df5881f84
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/obj/../src/libstd/macros.rs:44
   7:     0x55ab045e5ff3 - output::output_state::OutputState::background_thread::h8aaba65c8ed499f1
                        at /home/james/projects/storage/rust-output/src/output_state.rs:261
   8:     0x55ab045ea743 - output::output_state::OutputState::new::{{closure}}::hf69a0505c40cdaf3
                        at /home/james/projects/storage/rust-output/src/output_state.rs:79
   9:     0x55ab045dfeda - <std::panic::AssertUnwindSafe<F> as core::ops::FnOnce<()>>::call_once::hb24d824e514ccbad
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/obj/../src/libstd/panic.rs:295
  10:     0x55ab045b53d7 - std::panicking::try::do_call::h3a6460a838b6cf70
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/obj/../src/libstd/panicking.rs:356
  11:     0x55ab0483e20a - __rust_maybe_catch_panic
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/obj/../src/libpanic_unwind/lib.rs:97
  12:     0x55ab045b4eb6 - std::panicking::try::h3133aaaba181f2ff
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/obj/../src/libstd/panicking.rs:332
  13:     0x55ab045b389e - std::panic::catch_unwind::h0d4cf58b5fb1c352
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/obj/../src/libstd/panic.rs:351
  14:     0x55ab045ea000 - std::thread::Builder::spawn::{{closure}}::h4fae90249c97a5c1
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/obj/../src/libstd/thread/mod.rs:287
  15:     0x55ab045ce9fe - <F as alloc::boxed::FnBox<A>>::call_box::hef839d7658cc4ef9
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/obj/../src/liballoc/boxed.rs:595
  16:     0x55ab04833d64 - std::sys::imp::thread::Thread::new::thread_start::ha102a6120fc52763
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/obj/../src/liballoc/boxed.rs:605
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/obj/../src/libstd/sys_common/thread.rs:21
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/obj/../src/libstd/sys/unix/thread.rs:84
  17:     0x7fc0c43506b9 - start_thread
  18:     0x7fc0c3e7082c - clone
  19:                0x0 - <unknown>

@jamespharaoh do you have a standalone example for example that could be cargo run'd to reproduce? The code in that repository seems to just be a library?

I've attempted to recreate this in a simple piece of code but haven't had any success, will keep trying, but perhaps someone can give me some advice... I have a feeling this bug might be in some other synchronization that's going on, since I'm only creating one Sender and one Receiver, then the sender is passed around behind an Arc <Mutex>. These are being created in the main thread and passed to worker threads.

So it seems to me that somehow if part of the functionality to do that is not correctly synchronizing the state as this is cloned, locked, unlocked, and sent between threads, then it could certainly produce incorrect behaviour in the Sender behind it. Does that sound reasonable?

I am also using various C libraries etc but it seems unlikely that a problem with those would so reliably cause an error in the same place. Does this also seem reasonable?

Yeah if there's other unsafe code that may be something to take a look at as well, but otherwise we haven't seen a segfault in channels in a very long time so nothing jumps to mind unfortunately :(

The other unsafe code is just compression libraries linked in, well trusted and heavily used, and accessed in a very threadsafe way. And despite lots of concurrency they work flawlessly - the only error I am seeing is always this same receiver issue.

My next experiment to reproduce it will add in another element from the app where i see the problem, which is your futures_cpupool library as a way of passing the Sender around. I will be back with results...

Note that #40156 is similar to this so that makes me pretty confident it's not the fault of your local unsafe code. @jamespharaoh did you get anywhere with more investigation?

I haven't had a chance but will try soon.

I have a similar problem. recv_timeout panics
Sadness that I can't predict when it happens: the bug appears not every time.

                               at /checkout/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1:     0x56287ebb9a54 - std::sys_common::backtrace::_print::hd8a1b72dcf3955ef
                               at /checkout/src/libstd/sys_common/backtrace.rs:71
   2:     0x56287ebc149c - std::panicking::default_hook::{{closure}}::h5ff605bba7612658
                               at /checkout/src/libstd/sys_common/backtrace.rs:60
                               at /checkout/src/libstd/panicking.rs:355
   3:     0x56287ebc1064 - std::panicking::default_hook::h9bc4f6dfee57d6bd
                               at /checkout/src/libstd/panicking.rs:371
   4:     0x56287ebc18eb - std::panicking::rust_panic_with_hook::hdc01585dc2bf7122
                               at /checkout/src/libstd/panicking.rs:549
   5:     0x56287ebc17c4 - std::panicking::begin_panic::hf84f4975d9f9b642
                               at /checkout/src/libstd/panicking.rs:511
   6:     0x56287ebc16f9 - std::panicking::begin_panic_fmt::hcc3f360b2ba80419
                               at /checkout/src/libstd/panicking.rs:495
   7:     0x56287e3cbbc6 - <std::sync::mpsc::shared::Packet<T>>::decrement::he4fa9520181c5c85
                               at /checkout/src/libstd/macros.rs:51
   8:     0x56287e3c806f - <std::sync::mpsc::shared::Packet<T>>::recv::h3c95f5bc336537aa
                               at /checkout/src/libstd/sync/mpsc/shared.rs:232
   9:     0x56287e3ad379 - <std::sync::mpsc::Receiver<T>>::recv_max_until::h950909094e0767d9
                               at /checkout/src/libstd/sync/mpsc/mod.rs:966
  10:     0x56287e3acc85 - <std::sync::mpsc::Receiver<T>>::recv_timeout::hf72a64a0530efaa1
                               at /checkout/src/libstd/sync/mpsc/mod.rs:940
  11:     0x56287e466841 - <mould_extension::ConnectExtensionWorker as mould::worker::Worker<T>>::realize::hf8b7190433e70336
                               at /home/denis/vxrevenue/cloud/sub/mould-extension/src/lib.rs:129
  12:     0x56287e41cd77 - mould::server::process_session::{{closure}}::h0572e63ea7bd3be9
                               at /home/denis/vxrevenue/cloud/sub/mould/src/server.rs:83
  13:     0x56287e41b69a - mould::server::process_session::h54610d99cf99088f
                               at /home/denis/vxrevenue/cloud/sub/mould/src/server.rs:44
  14:     0x56287e41f76b - mould::server::wsmould::start::{{closure}}::hb36d2fb80ee5ded6
                               at /home/denis/vxrevenue/cloud/sub/mould/src/server.rs:272
  15:     0x56287e469345 - <std::panic::AssertUnwindSafe<F> as core::ops::FnOnce<()>>::call_once::h8ef904bc75108aeb
                               at /checkout/src/libstd/panic.rs:296
  16:     0x56287e3a2e7a - std::panicking::try::do_call::h0979d3031b45f486
                               at /checkout/src/libstd/panicking.rs:454
  17:     0x56287ebc897a - __rust_maybe_catch_panic
                               at /checkout/src/libpanic_unwind/lib.rs:98
  18:     0x56287e3a26de - std::panicking::try::h42b9334978084c46
                               at /checkout/src/libstd/panicking.rs:433
  19:     0x56287e39d4a3 - std::panic::catch_unwind::h5ea213ef0eb7edd1
                               at /checkout/src/libstd/panic.rs:361
  20:     0x56287e3a1a86 - std::thread::Builder::spawn::{{closure}}::h1288ffa1c4d83635
                               at /checkout/src/libstd/thread/mod.rs:360
  21:     0x56287e3fca66 - <F as alloc::boxed::FnBox<A>>::call_box::h1b125a486a246990
                               at /checkout/src/liballoc/boxed.rs:640
  22:     0x56287ebc0714 - std::sys::imp::thread::Thread::new::thread_start::h75b208405df6dcf1
                               at /checkout/src/liballoc/boxed.rs:650
                               at /checkout/src/libstd/sys_common/thread.rs:21
                               at /checkout/src/libstd/sys/unix/thread.rs:84
  23:     0x7f7d554096c9 - start_thread
  24:     0x7f7d54f2cf7e - clone
  25:                0x0 - <unknown>

Maybe it's important:

thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
(left: `140664964694432`, right: `0`)', /checkout/src/libstd/sync/mpsc/shared.rs:503

https://doc.rust-lang.org/src/std/sync/mpsc/shared.rs.html#503

impl<T> Drop for Packet<T> {
    fn drop(&mut self) {
        // Note that this load is not only an assert for correctness about
        // disconnection, but also a proper fence before the read of
        // `to_wake`, so this assert cannot be removed with also removing
        // the `to_wake` assert.
        assert_eq!(self.cnt.load(Ordering::SeqCst), DISCONNECTED);
/*503*/ assert_eq!(self.to_wake.load(Ordering::SeqCst), 0); 
        assert_eq!(self.channels.load(Ordering::SeqCst), 0);
    }
}

Not only one thread panics. Other thread panics here:

    fn decrement(&self, token: SignalToken) -> StartResult {
        unsafe {
/*253*/     assert_eq!(self.to_wake.load(Ordering::SeqCst), 0);
            let ptr = token.cast_to_usize();
            self.to_wake.store(ptr, Ordering::SeqCst);

Hi @alexcrichton, I've made an example which often fails and sometimes panic! on receive:
https://github.com/DenisKolodin/rfail
It uses a lot of channels with short lifetime period. It works the following way:

  1. spawn thread with hash-map which inserts or removes senders (like registration broker)
  2. spawn thread which waits for registration, sends 2 requests and waits responses
  3. it also spawned thread which registers itself and waits 2 request to respond them

To approve it panics I include trace I've taken:

thread '<unnamed>' panicked at 'assertion failed: `(left == right)` (left: `140671335870496`, right: `0`)', /checkout/src/libstd/sync/mpsc/shared.rs:253
note: Run with `RUST_BACKTRACE=1` for a backtrace.
thread '<unnamed>' panicked at 'sending request: "SendError(..)"', /checkout/src/libcore/result.rs:859
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Any', /checkout/src/libcore/result.rs:859

Sometimes it even loses messages and can be blocked.
I don't claim that my code don't contains bugs, but recv() must not panic! inside.

I hope it helps...

I've improved logging to make sure that the messages sends correctly. I saw a regularity: it brokes when sender instance moved (it transfers owning of the only one instance of Sender to HashMap). Π‘ould the shared state be in an intermediate state during owning transfership and lose an internal state?

DEBUG:rfail: - - - - - - - - - - - - - - - - - - - -
DEBUG:rfail: >>>>> 716 >>>>>
DEBUG:rfail: 716 - found : 0
DEBUG:rfail: 716 - request -> 0
DEBUG:rfail: 716 - response <- 0
DEBUG:rfail: 716 - found : 1
DEBUG:rfail: 716 - request -> 1
DEBUG:rfail: 716 - response <- 1
DEBUG:rfail: 716 - found : 2
DEBUG:rfail: 716 - request -> 2
DEBUG:rfail: 716 - response <- 2
DEBUG:rfail: <<<<< 716 <<<<<
DEBUG:rfail: - - - - - - - - - - - - - - - - - - - -
DEBUG:rfail: >>>>> 717 >>>>>
DEBUG:rfail: 717 - found : 0
thread '<unnamed>' panicked at 'assertion failed: `(left == right)` (left: `139883544888960`, right: `0`)', /checkout/src/libstd/sync/mpsc/shared.rs:253
note: Run with `RUST_BACKTRACE=1` for a backtrace.
thread '<unnamed>' panicked at 'receive result: RecvError', /checkout/src/libcore/result.rs:859
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Any', /checkout/src/libcore/result.rs:859

Any ideas how can I detect, catch and debug it?

@deniskolodin awesome thanks so much for the recent investigation into this!

The failure here I believe is specifically happening with finder and rbroker (the original channel). That's the only one which would have promoted itself to the shared.rs file. This is also pretty likely to be related to recv_timeout, but can you confirm that?

Locally I was unable to reproduce with RUST_LOG originally but with that env var set I've now been able to reproduce. Additionally I was also unable to reproduce once I locally replaced recv_timeout with recv. To me that implies that the bug is likely somewhere in abort_selection. I'll point out that selection over channels (a long since unstable feature) hasn't been as thoroughly tested as send and recv and the timeout infrastructure is using the same pieces as selection, so that may explain the instability!

It'd be great if you could help out investigating libstd in this regard, but I'll try to get around to taking a look too soon.

@alexcrichton I've checked it again with your suggestions. You are absolutely right: this bug appeared with recv_timeout only. I'll try to find it in shared.rs module...

FWIW, I've reduced @deniskolodin example to simpler one:

use std::thread;
use std::sync::Arc;
use std::time::Duration;
use std::sync::mpsc::channel;
use std::sync::atomic::AtomicUsize;
use std::sync::atomic::Ordering;

struct Barrier2 {
    c: Arc<AtomicUsize>,
}

impl Barrier2 {
    fn new() -> (Barrier2, Barrier2) {
        let a = Arc::new(AtomicUsize::new(0));
        (Barrier2 { c: a.clone() }, Barrier2 { c: a })
    }

    fn wait(&self) {
        self.c.fetch_add(1, Ordering::SeqCst);
        while self.c.load(Ordering::SeqCst) != 2 {
        }
    }
}

fn main() {
    for a in 0.. {
        println!("iter {}", a);

        let (a, b) = Barrier2::new();

        let (tx, rx) = channel();
        
        let th = thread::spawn(move || {
            a.wait();
            loop {
                match rx.recv_timeout(Duration::from_millis(1)) {
                    Ok(_) => {
                        break;
                    },
                    Err(_) => {
                    },
                }
            }
        });

        b.wait();
        thread::sleep(Duration::from_millis(1));
        tx.clone().send(()).expect("send");
        th.join().unwrap();
    }
}

In debug build it usually results in one of two assertions in 1000 iterations on my macbook.

I ran into the same problem. Here's a even easier example how to trigger the bug:

use std::sync::mpsc::channel;
use std::thread;
use std::time::Duration;

fn main() {
    let (tx, rx) = channel::<String>();

    thread::spawn(move || {
        let d = Duration::from_millis(10);
        loop {
            println!("recv");
            let _r = rx.recv_timeout(d);
        }
    });

    thread::sleep(Duration::from_millis(100));
    let _c1 = tx.clone();


    thread::sleep(Duration::from_secs(1));
}

@imnotbad ran into the same issue with a GTK project. I reduced the problem and investigated it. My reproduction case can be found in this playpen.

The problem is the following:
We start off with a oneshot or stream channel (it's the same for both). Now we recv_timeout on it. While that receive is running, the channel is upgraded to a shared. Due to having a (possibly blocking) receiver, the signal_token from the oneshot channel is moved over to the new shared channel (with inherit_blocker):

let sleeper = match p.upgrade(rx) {
oneshot::UpSuccess |
oneshot::UpDisconnected => None,
oneshot::UpWoke(task) => Some(task),
};
a.inherit_blocker(sleeper, guard);

This results in the wake_token to be removed from the oneshot and put into the shared. If it was a blocking receive, this would be fine, because the next send on the shared channel is going to wake up the receiver, which will upgrade itself and receive the element on the shared receiver.
With recv_timeout, the receive call will timeout and call abort_selection, which would have removed the signal_token, if it wasn't moved already. The next receive (blocking or not) after the timeout assumes that the signal_token is 0, because there can only ever be a single consumer. That is not the case due to the reasons above, resulting in this assertion panicking:

fn decrement(&self, token: SignalToken) -> StartResult {
unsafe {
assert_eq!(self.to_wake.load(Ordering::SeqCst), 0);

My idea for a solution to the problem would be to remove the inheritage of the wake_token from the sender side. Instead after the sender has performed the upgrade, it should wake up the receiver, which cleans the signal_token, upgrades itself and registers a new signal_token.
While the current idea of not waking up the receiver might be slightly faster, I think that this solution is easier to implement. I would have tried to create a PR, but I'm not sure how to correctly initiate the shared channel, which currently is done like this when inheriting:

pub fn inherit_blocker(&self,
token: Option<SignalToken>,
guard: MutexGuard<()>) {
token.map(|token| {
assert_eq!(self.cnt.load(Ordering::SeqCst), 0);
assert_eq!(self.to_wake.load(Ordering::SeqCst), 0);
self.to_wake.store(unsafe { token.cast_to_usize() }, Ordering::SeqCst);
self.cnt.store(-1, Ordering::SeqCst);
// This store is a little sketchy. What's happening here is that
// we're transferring a blocker from a oneshot or stream channel to
// this shared channel. In doing so, we never spuriously wake them
// up and rather only wake them up at the appropriate time. This
// implementation of shared channels assumes that any blocking
// recv() will undo the increment of steals performed in try_recv()
// once the recv is complete. This thread that we're inheriting,
// however, is not in the middle of recv. Hence, the first time we
// wake them up, they're going to wake up from their old port, move
// on to the upgraded port, and then call the block recv() function.
//
// When calling this function, they'll find there's data immediately
// available, counting it as a steal. This in fact wasn't a steal
// because we appropriately blocked them waiting for data.
//
// To offset this bad increment, we initially set the steal count to
// -1. You'll find some special code in abort_selection() as well to
// ensure that this -1 steal count doesn't escape too far.
unsafe { *self.steals.get() = -1; }
});

Looks like the fix PR for this issue was closed and haven't seen activity on this issue for a bit now. Just wanted to chime in that this is still an outstanding issue. (And I can't help but note that it is not very confidence inspiring safety-wise for would-be newcomers to find crashes in the stdlib unfixed for a long time...)

(i'm coming here from rust-windowing/winit#459 (comment) and only skimmed this issue.)

@alexcrichton do you have any news regarding the stability of recv_timeout? If there is a reproducible bug with that, would it make sense to mention it in https://doc.rust-lang.org/std/sync/mpsc/struct.Receiver.html#method.recv_timeout?

@felixrabe this is a pretty longstanding bug by this point and no one's really had a chance to dig in and solve it. It may be worth it, yes, at this point to mention it in the documentation.

I can still reproduce this issue with the example above on rustc 1.29.0-nightly (54628c8ea 2018-07-30) on macOS 10.13.6. I'm trying to come up with working on a PR (#52936) to document this bug as a known issue in https://github.com/rust-lang/rust/blob/master/src/libstd/sync/mpsc/mod.rs.

As you probably know I tried to fix this issue once (by rewriting implementation and simplifying it), but at the moment I got feedback about lack of explanation, I had no time to work on it again (because I was in the middle of changing jobs).

But now I can revive my patch again if anyone is willing to help to review it (e. g. by asking questions and pointing to badly documented parts of the code).

And a side note. Patching the issue is particularly hard to patch because of selectors. If we could drop selectors support (which are unstable now, feature mpsc_select, and likely will never be stable), the patch will be safer/easier. @alexcrichton can be drop selectors now?

I'd totally be down for moving selection support! It sounds like crossbeam is definitely in a good enough place to replace it, no?

I see, servo still has 3 uses of select!, and either crossbeam patch to servo will be merged soon or it seems like it's relatively easy to patch servo to use std::sync::mpsc without select!.

@stepancheg Out of interest: Why is it hard to incorporate the selection mechanic / to fix the race condition with it in place? From my point of view it shouldn't be much more than an additional field containing a SignalToken on each selected channel which is used instead of the single blocking receiver with information about which channel has awoken it. I can imagine the stealing logic being much more complicated.

Another general question: From what I've heard and seen about / in crossbeam-channel, it seems to do everything std::mpsc is doing but more and parts are even more efficient. For example crossbeam-channel was the only bounded spsc implementation I found which is lockfree if data is available. Would it be possible to replace the current mpsc implementation in std with crossbeam-channel?

@oberien

Why is it hard to incorporate the selection mechanic / to fix the race condition with it in place?

I think I did a patch which fixes race condition and preserves selection mechanic. It's doable. But proving (and understanding) why it is correct is harder than without select.

Would it be possible to replace the current mpsc implementation in std with crossbeam-channel?

Crossbeam may have its own bugs, may have slightly different semantics, and have parts not needed in stdlib API. That said I think it's possible.

But generally, I don't think that select! is a universally right way to work with multiple sources. For example, because select! does not guarantee ordering (send to channel A, send to channel B, receive from both channel, and A must come before B).

Moreover, selecting from multiple sources may be less performant than using a single channel, so select! is not a universal solution.

In my opinion, std::sync::mpsc should solve 95% of use cases (but not 100%) and be reliable. The rest is up to third-party libraries. That's again is only my opinion.

It seems #54267 is the same issue @deniskolodin describes. I'm hitting this bug using the channel from mpsc::channel() and then running recv_timeout on it without doing anything unusual.

What's interesting: it seems I'm able to work around this bug by making sure the first call to recv_timeout doesn't time out by sending a dummy message directly after creating the channel. This somehow prevents the channel from panicing, even if it is hitting timeouts afterwards.

As you seem to have investigated the issue quite a lot, you might be interested in my technical summary here (#39364 (comment)) if you haven't seen that already.

My guess why you can work around the issue by sending a message is that Oneshot is upgraded to Stream once the second message is sent. By sending (and receiving) a message at the very beginning, the (second) message may be sent in your case within the timeframe before the receiver calls recv_timeout the second time. If it called that function while the channel was upgraded (before the second message is sent), it should result in the panic. To avoid the issue (assuming you never upgrade to a shared channel), you could send two message before calling receive for the first time. That should ensure that your "solution" is not just shotgun debugging but actually working on all machines.

@oberien thanks for the hint. I still got panics occasionally by sending 1 dummy message, I'm now sending 2 dummy messages and hopefully work around the bug that way.

I still hope somebody with the right skills is able to fix it, my code looks a bit awkward right now.

Sending two message was still causing issues and I've refactored my code base to use crossbeam-channels instead: https://docs.rs/crossbeam-channel/0.2.6/crossbeam_channel/#how-to-try-sendingreceiving-a-message-with-a-timeout

So far I haven't had any issues, I would recommend this as a solution to everybody else who runs into this bug.

I'm not sure if this is the same issue or if I should open a new one, but I'm getting an extremely intermittent panic: thread 'tokio-runtime-worker-1' panicked at 'upgrading again', libstd/sync/mpsc/oneshot.rs:225:22, that happens on a clone (instead of on recv, as above).

My basic structure is the tx of a channel is cloned for each incoming request in an async hyper service, then moved along as part of a context for handling the request. A dedicated thread loops reading from the rx (Receiver::recv(), not recv_timeout), doing stuff with the sent data. I haven't been able to get a simple reproduction to get the error unfortunately, and can't consistently trigger it in my service, but figured I'd comment here in case someone else comes across the same or similar issue.

Googling around I found viperscape/oyashio#3 which appears to the be the same symptoms, but harder for me to know for sure it's the same issue, given it's in a library that's wrapping mpsc. For what it's worth, the reproduction in that issue (https://github.com/rohitjoshi/oyashio_test/) panics consistently.

I'm happy to include the full stacktrace if that's helpful.

@stearnsc

I'm happy to include the full stacktrace if that's helpful.

please do :)

@Centril here ya go :)

thread 'tokio-runtime-worker-1' panicked at 'upgrading again', libstd/sync/mpsc/oneshot.rs:225:22
note: Run with `RUST_BACKTRACE=1` for a backtrace.
thread 'tokio-runtime-worker-1' panicked at 'assertion failed: `(left == right)`
  left: `2`,
 right: `0`', libstd/sync/mpsc/shared.rs:504:9
stack backtrace:
   0:        0x10b1c1c9f - std::sys::unix::backtrace::tracing::imp::unwind_backtrace::hdfdb4ba75a1dcb88
                               at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1:        0x10b1af29a - std::sys_common::backtrace::print::h5823bdde4d2f8639
                               at libstd/sys_common/backtrace.rs:71
                               at libstd/sys_common/backtrace.rs:59
   2:        0x10b1c4ff3 - std::panicking::default_hook::{{closure}}::h06fc925781e26cfe
                               at libstd/panicking.rs:211
   3:        0x10b1c4d7c - std::panicking::default_hook::hdfdcc6f72e7aec0a
                               at libstd/panicking.rs:227
   4:        0x10b1c56e7 - <std::panicking::begin_panic::PanicPayload<A> as core::panic::BoxMeUp>::get::h7799216912757d62
                               at libstd/panicking.rs:475
   5:        0x10b1c528c - std::panicking::continue_panic_fmt::h2b83096170ce8c01
                               at libstd/panicking.rs:390
   6:        0x10b1c51e0 - std::panicking::try::do_call::hdceefc76f62a2c0e
                               at libstd/panicking.rs:345
   7:        0x10aa42bd9 - std::sys_common::poison::Flag::borrow::hdc33e01fc10736c1
                               at /Users/travis/build/rust-lang/rust/src/libstd/macros.rs:78
   8:        0x10aa1a714 - core::ptr::drop_in_place::h305b501f7759d2b1
                               at /Users/travis/build/rust-lang/rust/src/libcore/ptr.rs:59
   9:        0x10aa39e78 - core::fmt::Arguments::new_v1::he05aa87c2ad77613
                               at /Users/travis/build/rust-lang/rust/src/liballoc/sync.rs:528
  10:        0x10aa3a481 - core::fmt::Arguments::new_v1::he05aa87c2ad77613
                               at /Users/travis/build/rust-lang/rust/src/liballoc/sync.rs:981
  11:        0x10aa19bb4 - core::ptr::drop_in_place::h10c70cb8488cae16
                               at /Users/travis/build/rust-lang/rust/src/libcore/ptr.rs:59
  12:        0x10aa2ab4d - <core::time::Duration as core::cmp::PartialOrd>::gt::hf55976aad1a96054
                               at /Users/travis/build/rust-lang/rust/src/libstd/sync/mpsc/mod.rs:879
  13:        0x10aa11182 - <alloc::string::String as core::ops::deref::Deref>::deref::hc1da3ce597b47e6c
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/rustracing-0.1.7/src/span.rs:569
  14:        0x10aa446c0 - antidote::Condvar::new::h94f0f259c4a01b2f
                               at /Users/cstearns/workspace/backend-sdk/src/context.rs:220
  15:        0x10a5ca316 - backend::context::HeadersTracer::extract::heed8c7d5593220ee
                               at /Users/cstearns/workspace/backend-sdk/src/context.rs:193
  16:        0x10a6a43dd - <&'static geo_proto::server::ServiceWrapper<T> as hyper::service::service::Service>::call::h4088fe8e2004c32b
                               at /Users/cstearns/workspace/geo/proto/src/server.rs:789
  17:        0x10a42c78a - <hyper::proto::h1::dispatch::Server<S> as hyper::proto::h1::dispatch::Dispatch>::recv_msg::he8ce4282fe68ccae
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.12.13/src/proto/h1/dispatch.rs:405
  18:        0x10a43027f - <hyper::proto::h1::dispatch::Dispatcher<D, Bs, I, T>>::poll_read_head::ha735d16af68f7403
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.12.13/src/proto/h1/dispatch.rs:213
  19:        0x10a430dbb - <hyper::proto::h1::dispatch::Dispatcher<D, Bs, I, T>>::poll_read::h804f6c968588bc08
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.12.13/src/proto/h1/dispatch.rs:136
  20:        0x10a42d35d - <hyper::proto::h1::dispatch::Dispatcher<D, Bs, I, T>>::poll_inner::h45d5c57642442667
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.12.13/src/proto/h1/dispatch.rs:100
  21:        0x10a42ced5 - <hyper::proto::h1::dispatch::Dispatcher<D, Bs, I, T>>::poll_catch::h7d5901add4493160
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.12.13/src/proto/h1/dispatch.rs:86
  22:        0x10a42cbd0 - <hyper::proto::h1::dispatch::Dispatcher<D, Bs, I, T> as futures::future::Future>::poll::hf29c3a2a139550c0
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.12.13/src/proto/h1/dispatch.rs:348
  23:        0x10a6a2bf3 - <futures::future::either::Either<A, B> as futures::future::Future>::poll::hdbe024dc67d69328
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.23/src/future/either.rs:35
  24:        0x10a5be535 - futures::future::option::<impl futures::future::Future for core::option::Option<F>>::poll::hd40ade923f4904f9
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.23/src/future/option.rs:12
  25:        0x10a4ec976 - <hyper::server::conn::upgrades::UpgradeableConnection<I, S, E> as futures::future::Future>::poll::h6cfcedd0edcfcbb0
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.12.13/src/server/conn.rs:844
  26:        0x10a60f724 - <hyper::server::conn::spawn_all::NewSvcTask<I, N, S, E, W> as futures::future::Future>::poll::h606e46fb5ec0706b
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.12.13/src/server/conn.rs:784
  27:        0x10b1371b4 - core::num::<impl usize>::max_value::hc9bd7cea570a51dd
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.23/src/future/mod.rs:113
  28:        0x10af82244 - <bool as core::fmt::Debug>::fmt::h2884039bf412b812
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.23/src/task_impl/mod.rs:289
  29:        0x10af824ff - <bool as core::fmt::Debug>::fmt::h2884039bf412b812
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.23/src/task_impl/mod.rs:363
  30:        0x10af92168 - futures::task_impl::std::CURRENT_THREAD_NOTIFY::__init::h5c9531a18358d35e
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.23/src/task_impl/std/mod.rs:78
  31:        0x10af82387 - <bool as core::fmt::Debug>::fmt::h2884039bf412b812
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.23/src/task_impl/mod.rs:363
  32:        0x10af82151 - <bool as core::fmt::Debug>::fmt::h2884039bf412b812
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.23/src/task_impl/mod.rs:289
  33:        0x10afa7a90 - tokio_threadpool::task::TaskFuture::poll::hafc1ba2a07fcd66f
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-threadpool-0.1.5/src/task/mod.rs:292
  34:        0x10afa7613 - tokio_threadpool::task::Task::run::{{closure}}::h863b8c6f21050706
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-threadpool-0.1.5/src/task/mod.rs:165
  35:        0x10af87f3c - core::ops::function::FnOnce::call_once::hc850d76ffce273ed
                               at /Users/travis/build/rust-lang/rust/src/libcore/ops/function.rs:223
  36:        0x10afa52ff - <tokio_threadpool::worker::state::Lifecycle as core::cmp::PartialEq>::eq::h27cd1ed0969097cf
                               at /Users/travis/build/rust-lang/rust/src/libstd/panic.rs:313
  37:        0x10afa104c - tokio_threadpool::builder::Builder::new::{{closure}}::h71fbc3e5b855cd87
                               at /Users/travis/build/rust-lang/rust/src/libstd/panicking.rs:310
  38:        0x10b1d111e - panic_unwind::dwarf::eh::read_encoded_pointer::hee47015c59adccfd
                               at libpanic_unwind/lib.rs:105
  39:        0x10afa0e65 - tokio_threadpool::builder::Builder::new::{{closure}}::h71fbc3e5b855cd87
                               at /Users/travis/build/rust-lang/rust/src/libstd/panicking.rs:289
  40:        0x10afa5496 - crossbeam_epoch::epoch::Epoch::starting::h672ae80797b046f0
                               at /Users/travis/build/rust-lang/rust/src/libstd/panic.rs:392
  41:        0x10afa70ae - tokio_threadpool::task::Task::run::hc1949961ab52a308
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-threadpool-0.1.5/src/task/mod.rs:151
  42:        0x10af803e1 - tokio_threadpool::worker::Worker::run_task2::h131d9bedd9d64183
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-threadpool-0.1.5/src/worker/mod.rs:545
  43:        0x10af7febf - tokio_threadpool::worker::Worker::run_task::h082f7483774f3664
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-threadpool-0.1.5/src/worker/mod.rs:445
  44:        0x10af7f774 - tokio_threadpool::worker::Worker::try_run_owned_task::h48a23cb6cf525764
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-threadpool-0.1.5/src/worker/mod.rs:385
  45:        0x10af7f1e4 - tokio_threadpool::worker::Worker::try_run_task::hfccad35dab3d62a2
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-threadpool-0.1.5/src/worker/mod.rs:293
  46:        0x10af7f037 - tokio_threadpool::worker::Worker::with_current::{{closure}}::h18ffed76810985e1
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-threadpool-0.1.5/src/worker/mod.rs:239
  47:        0x10af41adb - tokio::runtime::builder::Builder::build::{{closure}}::{{closure}}::{{closure}}::{{closure}}::hbd11d6d37d1fbcd9
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.1.7/src/runtime/builder.rs:125
  48:        0x10af2d0d0 - std::sys::unix::mutex::Mutex::destroy::he4ed2d2d4d3fe4e9
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.5/src/timer/handle.rs:65
  49:        0x10af363f1 - <tokio::executor::current_thread::scheduler::Scheduler<U>>::tick::{{closure}}::hd0e309670d28eb9a
                               at /Users/travis/build/rust-lang/rust/src/libstd/thread/local.rs:294
  50:        0x10af3539f - <tokio::executor::current_thread::scheduler::Scheduler<U>>::tick::{{closure}}::hd0e309670d28eb9a
                               at /Users/travis/build/rust-lang/rust/src/libstd/thread/local.rs:248
  51:        0x10af2cf9b - std::sys::unix::mutex::Mutex::destroy::he4ed2d2d4d3fe4e9
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.5/src/timer/handle.rs:57
  52:        0x10af41b1c - tokio::runtime::builder::Builder::build::{{closure}}::{{closure}}::{{closure}}::h787a87c411c8ccd0
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.1.7/src/runtime/builder.rs:124
  53:        0x10af2eb7a - futures::task_impl::core::get_ptr::h447d1a909e0b1e6c
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.5/src/clock/clock.rs:136
  54:        0x10af355db - <tokio::executor::current_thread::scheduler::Scheduler<U>>::tick::{{closure}}::hd0e309670d28eb9a
                               at /Users/travis/build/rust-lang/rust/src/libstd/thread/local.rs:294
  55:        0x10af35347 - <tokio::executor::current_thread::scheduler::Scheduler<U>>::tick::{{closure}}::hd0e309670d28eb9a
                               at /Users/travis/build/rust-lang/rust/src/libstd/thread/local.rs:248
  56:        0x10af2ea9a - futures::task_impl::core::get_ptr::h447d1a909e0b1e6c
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.5/src/clock/clock.rs:119
  57:        0x10af41b65 - tokio::runtime::builder::Builder::build::{{closure}}::{{closure}}::hfbb7e55fcd9aa796
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.1.7/src/runtime/builder.rs:123
  58:        0x10af3fd4a - tokio::runtime::Runtime::inner::h9d32ac1a93f7975c
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-reactor-0.1.3/src/lib.rs:231
  59:        0x10af35dd0 - <tokio::executor::current_thread::scheduler::Scheduler<U>>::tick::{{closure}}::hd0e309670d28eb9a
                               at /Users/travis/build/rust-lang/rust/src/libstd/thread/local.rs:294
  60:        0x10af352ea - <tokio::executor::current_thread::scheduler::Scheduler<U>>::tick::{{closure}}::hd0e309670d28eb9a
                               at /Users/travis/build/rust-lang/rust/src/libstd/thread/local.rs:248
  61:        0x10af3fb56 - tokio::runtime::Runtime::inner::h9d32ac1a93f7975c
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-reactor-0.1.3/src/lib.rs:214
  62:        0x10af41c86 - tokio::runtime::builder::Builder::build::{{closure}}::h16b5f9a94ee84d9c
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.1.7/src/runtime/builder.rs:122
  63:        0x10afa6765 - tokio_threadpool::callback::Callback::call::hdcb6a35b1280c01c
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-threadpool-0.1.5/src/callback.rs:21
  64:        0x10af7ecf7 - tokio_threadpool::worker::Worker::do_run::{{closure}}::{{closure}}::hc83871484e2a061f
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-threadpool-0.1.5/src/worker/mod.rs:121
  65:        0x10af91764 - tokio_threadpool::thread_pool::ThreadPool::inner::he3d97c8cc1567558
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-executor-0.1.3/src/global.rs:193
  66:        0x10af8620d - <tokio_threadpool::task::blocking::State as core::cmp::PartialEq>::eq::h76e27fcd622f5037
                               at /Users/travis/build/rust-lang/rust/src/libstd/thread/local.rs:294
  67:        0x10af85d7f - <tokio_threadpool::task::blocking::State as core::cmp::PartialEq>::eq::h76e27fcd622f5037
                               at /Users/travis/build/rust-lang/rust/src/libstd/thread/local.rs:248
  68:        0x10af9165e - tokio_threadpool::thread_pool::ThreadPool::inner::he3d97c8cc1567558
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-executor-0.1.3/src/global.rs:163
  69:        0x10af7ed96 - tokio_threadpool::worker::Worker::do_run::{{closure}}::hca0272d3164bf83f
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-threadpool-0.1.5/src/worker/mod.rs:119
  70:        0x10af86802 - <tokio_threadpool::task::blocking::State as core::cmp::PartialEq>::eq::h76e27fcd622f5037
                               at /Users/travis/build/rust-lang/rust/src/libstd/thread/local.rs:294
  71:        0x10af85c6c - <tokio_threadpool::task::blocking::State as core::cmp::PartialEq>::eq::h76e27fcd622f5037
                               at /Users/travis/build/rust-lang/rust/src/libstd/thread/local.rs:248
  72:        0x10af7ebe6 - tokio_threadpool::worker::Worker::do_run::h739664813d9dab9f
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-threadpool-0.1.5/src/worker/mod.rs:110
  73:        0x10af7d677 - tokio_threadpool::pool::Pool::spawn_thread::{{closure}}::h4beb3ef769cda01a
                               at /Users/cstearns/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-threadpool-0.1.5/src/pool/mod.rs:417
  74:        0x10af87d77 - crossbeam_epoch::collector::Handle::is_pinned::h1a5cea1088f22d7e
                               at /Users/travis/build/rust-lang/rust/src/libstd/sys_common/backtrace.rs:136
  75:        0x10af85867 - <tokio_threadpool::task::blocking::State as core::cmp::PartialEq>::eq::h76e27fcd622f5037
                               at /Users/travis/build/rust-lang/rust/src/libstd/thread/mod.rs:409
  76:        0x10afa52c7 - <tokio_threadpool::worker::state::Lifecycle as core::cmp::PartialEq>::eq::h27cd1ed0969097cf
                               at /Users/travis/build/rust-lang/rust/src/libstd/panic.rs:313
  77:        0x10afa10c4 - tokio_threadpool::builder::Builder::new::{{closure}}::h71fbc3e5b855cd87
                               at /Users/travis/build/rust-lang/rust/src/libstd/panicking.rs:310
  78:        0x10b1d111e - panic_unwind::dwarf::eh::read_encoded_pointer::hee47015c59adccfd
                               at libpanic_unwind/lib.rs:105
  79:        0x10afa0f64 - tokio_threadpool::builder::Builder::new::{{closure}}::h71fbc3e5b855cd87
                               at /Users/travis/build/rust-lang/rust/src/libstd/panicking.rs:289
  80:        0x10afa5447 - crossbeam_epoch::epoch::Epoch::starting::h672ae80797b046f0
                               at /Users/travis/build/rust-lang/rust/src/libstd/panic.rs:392
  81:        0x10af856af - <tokio_threadpool::task::blocking::State as core::cmp::PartialEq>::eq::h76e27fcd622f5037
                               at /Users/travis/build/rust-lang/rust/src/libstd/thread/mod.rs:408
  82:        0x10af86e6f - <tokio_threadpool::task::blocking::State as core::cmp::PartialEq>::eq::h76e27fcd622f5037
                               at /Users/travis/build/rust-lang/rust/src/liballoc/boxed.rs:642
  83:        0x10b1c4837 - std::sys_common::thread::start_thread::hffea5e9f9f11b206
                               at /Users/travis/build/rust-lang/rust/src/liballoc/boxed.rs:652
                               at libstd/sys_common/thread.rs:24
  84:        0x10b1ae2c8 - std::sys::unix::thread::Thread::new::thread_start::hce603054d9b5b6fc
                               at libstd/sys/unix/thread.rs:90
  85:     0x7fff6a04e660 - _pthread_body
  86:     0x7fff6a04e50c - _pthread_start
thread panicked while panicking. aborting.
Illegal instruction: 4

I have looked at this just a little using rr, and the problems seems to arise as a result of upgrading the oneshot channel to shared.
The inherit_blocker() function writes self.to_wake with a pointer to a token:
self.to_wake.store(unsafe { token.cast_to_usize() }, Ordering::SeqCst); - This happens when the sender is cloned.

Then later wen recv is called, decrement() sees the token that was stored and panics.

@snakehand You are correct, here is a description of the cause of this error: #39364 (comment)

@oberien I am wondering if maybe it would be simpler and safe to just set to_wake to 0 here:
if let Some(new_port) = port_or_empty { unsafe { mem::swap(self.inner_mut(), new_port.inner_mut()); } }
inside recv_deadline() - this is where the RX thread detects the upgrade, and it should not be sleeping, so I am guessing it would be safe to clear the to_wake here ? ( Not sure if it also needs to be dropped )

I made a quick patch here : https://pastebin.com/xWNaLR5H
setting cnt to 0 was also required to avoid another assert in decrement() - but this needs to be reviewd as it is pretty much a fudge. With these changes the short code example runs without any panics, or leaks.

Edit: Tested the longer example (22 Jun 2017) - this will frequently hang, since there is a race condition that can cause a channel drop() to hang from inconsistencies in the cnt value.

I have a fix / workaround that is more convoluted.
When the recv_timeout detects that an uprade has happened, it sets a flag in the shared channel that an upgrade has happened.
The recv function now checks for this bool, and performs a "downgrade" that fixes the to_wake / cnt / steals value before it does anything else.
This is pretty robust, even under fire from all the races that the first example throws at the code. ( dropping RX and TX at the same time as adjusting these values causes innumerable race conditions, so doing the "downgrade" during recv, removes drop of RX from the equation )

Sample patch here : https://pastebin.com/7tuRhY87 - but it needs a cleanup / write-up

All is still not rosy, since a some new race condition can rear their head with among other things "impossible asserts" in inherit blocker - after some 10000s iterations.

It could be that "let sleeper = match p.upgrade(rx)" triggers a pointer swap in the RX thread, that modifies the newly created shared channel before inherit sleepers gets a chance to do the asserts. This kind of makes me question the soundness of the whole mpsc lib :-(

Is this being worked on? It is quite hard to build a clean solution to gracefully shutting down multiple threads when the channels to communicate between them randomly panic and close. I run into this issue more times than not when I start my project, but because the main thread survives I can't even let systemd automatically restart it until it works, since the process is still going.

@hnrklssn as a workaround you can use a crossbeam channel.

@hnrklssn There is also a work-around, you can clone() the tx before doing the recv_timeout()

@lnicola Yeah I just thought it would be nice for people new to Rust, like me, to be able to rely on the stdlib. The number of dependencies explodes pretty quickly when pulling new ones into a project.

@snakehand Thanks, that fixed it! I had to change the structure of my program slightly, which is fine but I do find it a bit uglier than before.

You know this has been open for a really really long time, and seemingly nobody is doing anything about it, personally I feel as though I lack the skills to fix it myself but perhaps some attention being bought to this issue would be nice? I realize just saying "pls fix" is not helpful but it would be nice for the people running into this issue not to be confused about it and waste time debugging things that are breaking due to the stdlib and not their code.

Following up on the last comment, just wondering if there are there any plans to address this?

I don't think so. If I recall correctly there were some talks about adding crossbeam-channel to the standard library as additional channel implementation and deprecate std::sync::mpsc. I believe the interface between the two is sufficiently incompatible that it is not possible to make the current channel interface backed by crossbeam-channel. In addition I believe someone said that there are mistakes in the interface of std::sync::mpsc or something like that. In any case crossbeam-channel should probably be used instead of std::sync::mpsc.

Would it be worthwhile to add a note to the stdlib docs that says something to the effect of "Consider using the crossbeam crate if you are concerned about running into this bug" ?

Or is the link to this issue sufficient?

Pointing to this issue seems reasonable, and then this issue can point to the most up-to-date information.

And yeah, the right answer seems like referring people to another crate, and possibly figuring out if that other crate could become the backend for mpsc. flume, crossbeam, there are a few possible options.

Could we not just fix this? Maybe it will break some stuff, but we can do it carefully, and do a lot of testing. At the end of the day, Rust is known for being safe, and perhaps reliable, and this is anything but.

This bug is mentioned in the std docs, after all...

https://doc.rust-lang.org/nightly/std/sync/mpsc/struct.Receiver.html#known-issues

"There is currently a known issue (see #39364) that causes recv_timeout to panic unexpectedly with the following example: ..."

I think people know how to fix it, and I think we just need to make a decision and do it, at this point...

#39364 (comment) supposedly has a fix for this specific bug, but it either introduces another bug of the same kind or that other bug already exists as far as I understand.

All is still not rosy, since a some[sic] new race condition can rear their head with among other things "impossible asserts" in inherit blocker - after some 10000s iterations.

It could be that "let sleeper = match p.upgrade(rx)" triggers a pointer swap in the RX thread, that modifies the newly created shared channel before inherit sleepers gets a chance to do the asserts. This kind of makes me question the soundness of the whole mpsc lib :-(

So sure, but are we happy having a big problem with something in std? Surely this is the kind of thing that needs action to fix. As I mentioned before, Rust's reputation comes from being (or striving to be) correct. And this bit isn't.

I understand there might not be an easy fix. Also, honestly, this isn't a problem for me any more. But it doesn't paint a great picture for people who are new to the language, ecosystem, community, etc...

I absolutely agree that we should fix it; my suggestion is that if at all possible we should do so by making mpsc a compatibility wrapper around some established channel library.

If that's not possible, then we should deprecate the API, and point to established channel libraries.

n1000 commented

#39364 (comment) supposedly has a fix for this specific bug, but it either introduces another bug of the same kind or that other bug already exists as far as I understand.

There was also this pull request that apparently fixed the issue (although I did not try it): #42883 ,but I think the changes were quite extensive and could not be fully validated.

If that's not possible, then we should deprecate the API, and point to established channel libraries.

In that case maybe we should remove all performance optimizations like those for oneshot channels to cleanup the code enough such that this bug will likely be fixed in the process? Or maybe even replace with a fully safe albeit much slower Mutex<VecDeque<T>>?

Or maybe even replace with a fully safe albeit much slower Mutex<VecDeque<T>>?

That seems like a really good idea, if we can't just build the API we have atop an established channel crate.

Deprecating std::sync::mpsc sounds like a good idea.

I hope std::sync::mpsc will be deprecated only after a suitable replacement is added to std.

I hope std::sync::mpsc will be deprecated only after a suitable replacement is added to std.

...but there's precedent for not doing so in std::env::home_dir.

Deprecated since 1.29.0: This function’s behavior is unexpected and probably not what you want. Consider using a crate from crates.io instead.

I hope std::sync::mpsc will be deprecated only after a suitable replacement is added to std.

It can surely be depcrecated, perhaps with a link to something better as suggested in the next comment. But deprecation doesn't mean deletion - it just means that it is no longer recommended.

I think it makes sense to deprecate as soon as possible, and to add suggested workarounds as soon as possible afterwards.

In any case, the docs for this already include a link to this github issue....

If anyone is looking at this anew the description from @oberien in #39364 (comment) and their deterministic reproducer is probably the best start:

use std::sync::mpsc;
use std::thread;
use std::time::Duration;

fn main() {
    std::env::set_var("RUST_BACKTRACE", "full");

    let (tx, rx) = mpsc::channel::<()>();
    // same issue happens on upgrade from stream to shared
    //tx.send(());
    //tx.send(());
    //rx.recv();
    //rx.recv();
    
    let t = thread::spawn(move || {
        std::thread::sleep(Duration::from_millis(300));
        
        println!("upgrade to shared while recv_timeout on oneshot");
        let _ = tx.clone();
        // don't drop sender, otherwise second read_timeout exits early
        // with disconnected
        std::mem::forget(tx);
    });
    
    println!("first read, on oneshot");
    let _ = rx.recv_timeout(Duration::from_millis(500));
    
    t.join().unwrap();
    
    println!("second read, on shared");
    let _ = rx.recv();
}

I opened #93563, which should resolve this issue.

daira commented

Reading through this ticket, I strongly recommend that std::sync::mpsc be deprecated with a prominent warning in the API doc β€” probably suggesting to use crossbeam-channel. I only found this ticket by luck browsing through the docs for a function we didn't propose to use (recv_timeout), even though as far as I can see, the potential race conditions aren't limited to that function at all. There is an easily reproducible panic with recv_timeout, but the problems are with the entire library and are not avoidable.

If #93563 is going to be merged instead then fine, but there hasn't been any discussion of that since February.

8573 commented

One alternative to crossbeam-channel could be zesterer/flume#18