rust-lang/rust

std `thread::Thread::unpark` test deadlocked when run under miri

Closed this issue ยท 9 comments

Failed in #145384 (comment)

./x test miri --stage 2 library/std --target aarch64-apple-darwin,i686-pc-windows-msvc -- time:: sync:: thread:: env::
failures:

---- library/std/src/thread/mod.rs - thread::Thread::unpark (line 1494) stdout ----
Test executable failed (exit status: 1).

stdout:
Unpark the thread
Parking thread

stderr:
error: the evaluated program deadlocked
    --> /checkout/library/std/src/sys/pal/unix/sync/condvar.rs:47:73
     |
  47 |         let r = unsafe { libc::pthread_cond_wait(self.raw(), mutex.raw()) };
     |                                                                         ^ this thread got stuck here
     |
     = note: BACKTRACE on thread `unnamed-1`:
     = note: inside `std::sys::pal::unix::sync::condvar::Condvar::wait` at /checkout/library/std/src/sys/pal/unix/sync/condvar.rs:47:73: 47:74
note: inside `std::sys::sync::thread_parking::pthread::Parker::park`
    --> /checkout/library/std/src/sys/sync/thread_parking/pthread.rs:79:13
     |
  79 |             self.cvar().wait(self.lock());
     |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `std::thread::Thread::park`
    --> /checkout/library/std/src/thread/mod.rs:1472:18
     |
1472 |         unsafe { self.inner.as_ref().parker().park() }
     |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `std::thread::park`
    --> /checkout/library/std/src/thread/mod.rs:1102:9
     |
1102 |         current().park();
     |         ^^^^^^^^^^^^^^^^
note: inside closure
    --> library/std/src/thread/mod.rs:1501:9
     |
  10 |         thread::park();
     |         ^^^^^^^^^^^^^^

error: the evaluated program deadlocked
    --> /checkout/library/std/src/sys/pal/unix/thread.rs:393:66
     |
 393 |         let ret = unsafe { libc::pthread_join(id, ptr::null_mut()) };
     |                                                                  ^ this thread got stuck here
     |
     = note: BACKTRACE:
     = note: inside `std::sys::pal::unix::thread::Thread::join` at /checkout/library/std/src/sys/pal/unix/thread.rs:393:66: 393:67
note: inside `std::thread::JoinInner::<'_, ()>::join`
    --> /checkout/library/std/src/thread/mod.rs:1769:9
     |
1769 |         self.native.join();
     |         ^^^^^^^^^^^^^^^^^^
note: inside `std::thread::JoinHandle::<()>::join`
    --> /checkout/library/std/src/thread/mod.rs:1924:9
     |
1924 |         self.0.join()
     |         ^^^^^^^^^^^^^
note: inside `main::_doctest_main_library_std_src_thread_mod_rs_1494_0`
    --> library/std/src/thread/mod.rs:1512:1
     |
  21 | parked_thread.join().unwrap();
     | ^^^^^^^^^^^^^^^^^^^^
note: inside `main`
    --> library/std/src/thread/mod.rs:1513:3
     |
  22 | } _doctest_main_library_std_src_thread_mod_rs_1494_0() }
     |   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `<fn() as std::ops::FnOnce<()>>::call_once - shim(fn())`
    --> /checkout/library/core/src/ops/function.rs:253:5
     |
 253 |     extern "rust-call" fn call_once(self, args: Args) -> Self::Output;
     |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

note: some details are omitted, run with `MIRIFLAGS=-Zmiri-backtrace=full` for a verbose backtrace

error: aborting due to 2 previous errors




failures:
    library/std/src/thread/mod.rs - thread::Thread::unpark (line 1494)

test result: FAILED. 235 passed; 1 failed; 2 ignored; 0 measured; 1093 filtered out; finished in 39.81s

all doctests ran in 39.94s; merged doctests compilation took 0.08s
error: doctest failed, to rerun pass `-p std --doc`
Bootstrap failed while executing `miri --stage 2 library/std --target aarch64-apple-darwin,i686-pc-windows-msvc -- time:: sync:: thread:: env::`

cc @RalfJung in case you have any ideas if this is a miri problem or a std problem, or if it's only a problem when ran under miri

in case you have any ideas if this is a miri problem or a std problem, or if it's only a problem when ran under miri

This is almost certainly a bug in std that only manifests when run by Miri. So... both?

Oh wait I see apple, so I wonder if this is somehow related to the MacOS RwLock: #121950

Oh! Plausible, I didn't quite make that connection.

I don't see how that could cause a deadlock though, usually it can only cause a UB error.

Also, this doesn't even seem to involve an rwlock, only a condvar, so the problematic queue isn't even involved.

I tried 10000 seeds on rustc 125ff8a and f6d2341 each and got no failure. No idea how to reproduce this problem...

I was actually able to reproduce this with -Zmiri-disable-isolation -Zmiri-many-seeds=0..10000. Seems like we need to use a "real" clock to trigger the error, not Miri's isolated fake clock. Very strange...

Of course this means the "reproducer" is entirely non-deterministic since it depends on the timing of that sleep call...

I managed to catch a trace of a stuck execution. There is a "fun" interaction with std::sys::sync::once::queue::wait: the unpark of the main thread gets "eaten" by the "wrong" park of the other thread. Specifically the 2nd thread is parking while doing something with the STDOUT OnceLock (I think -- I don't see line numbers, but I see a method name like std::sync::OnceLock::<std::sync::ReentrantLock<std::cell::RefCell<std::io::LineWriter<std::io::stdio::StdoutRaw>>>>::initialize::<{closure@std::sync::OnceLock<std::sync::ReentrantLock<std::cell::RefCell<std::io::LineWriter<std::io::stdio::StdoutRaw>>>>::get_or_init<{closure@std::io::stdout::{closure#0}}>::{closure#0}}, !>).

The order of events seems to be something like

  • 2nd thread parks (somewhere inside the stdout printing logic)
  • 1st thread calls unpark (from its stdout printing logic)
  • before the 2nd thread gets the chance to take any step, 1st thread calls unpark again (from main), which is a NOP since the state is still NOTIFIED
  • 2nd thread wakes up, and sets the state from NOTIFIED to EMPTY, fully consuming both wakeups
  • 2nd thread parks again, and since the state is not EMPTY it'll never wake up again

I first thought this might be a bug in the unparker, but really, isn't this just a bug in the example? If you call park, you must ensure that you are in some sort of queue where someone will call unpark on you later, or someone just called unpark on you "very recently" (i.e., definitely after the most recent time you were parked). The example performs no such coordination. The docs say "unpark followed by park will result in the second call returning immediately", but crucially this is only true if there's no other park still running when that unpark runs. In the doctest, when the main thread calls unpark, it can be the case that the 2nd thread is still inside the park of some other unrelated data structure, and then the unpark gets lost.

Compare that with this code:

// Try to slide in the node at the head of the linked list, making sure
// that another thread didn't just replace the head of the linked list.
if let Err(new) = state_and_queue.compare_exchange_weak(
current,
ptr::from_ref(node).wrapping_byte_add(state) as StateAndQueue,
Release,
Acquire,
) {
current = new;
continue;
}
// We have enqueued ourselves, now lets wait.
// It is important not to return before being signaled, otherwise we
// would drop our `Waiter` node and leave a hole in the linked list
// (and a dangling reference). Guard against spurious wakeups by
// reparking ourselves until we are signaled.
while !node.signaled.load(Acquire) {
// If the managing thread happens to signal and unpark us before we
// can park ourselves, the result could be this thread never gets
// unparked. Luckily `park` comes with the guarantee that if it got
// an `unpark` just before on an unparked thread it does not park.

Here, crucially, the unpark that the comment refers to must have happened between us entering the queue and us calling park, which means there's no other park that could have consumed it.

(The docs also are worded strangely: "Because the token is initially absent, unpark followed by park will result in the second call returning immediately." But this has nothing to do with whether the token is initially absent or present.)

Cc @m-ou-se @joboet

The example in fn park has the same issue I think: nothing guarantees that the unpark in the main thread doesn't get consumed by some other park in the 2nd thread.