std `thread::Thread::unpark` test deadlocked when run under miri
Closed this issue ยท 9 comments
Failed in #145384 (comment)
./x test miri --stage 2 library/std --target aarch64-apple-darwin,i686-pc-windows-msvc -- time:: sync:: thread:: env::
failures:
---- library/std/src/thread/mod.rs - thread::Thread::unpark (line 1494) stdout ----
Test executable failed (exit status: 1).
stdout:
Unpark the thread
Parking thread
stderr:
error: the evaluated program deadlocked
--> /checkout/library/std/src/sys/pal/unix/sync/condvar.rs:47:73
|
47 | let r = unsafe { libc::pthread_cond_wait(self.raw(), mutex.raw()) };
| ^ this thread got stuck here
|
= note: BACKTRACE on thread `unnamed-1`:
= note: inside `std::sys::pal::unix::sync::condvar::Condvar::wait` at /checkout/library/std/src/sys/pal/unix/sync/condvar.rs:47:73: 47:74
note: inside `std::sys::sync::thread_parking::pthread::Parker::park`
--> /checkout/library/std/src/sys/sync/thread_parking/pthread.rs:79:13
|
79 | self.cvar().wait(self.lock());
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `std::thread::Thread::park`
--> /checkout/library/std/src/thread/mod.rs:1472:18
|
1472 | unsafe { self.inner.as_ref().parker().park() }
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `std::thread::park`
--> /checkout/library/std/src/thread/mod.rs:1102:9
|
1102 | current().park();
| ^^^^^^^^^^^^^^^^
note: inside closure
--> library/std/src/thread/mod.rs:1501:9
|
10 | thread::park();
| ^^^^^^^^^^^^^^
error: the evaluated program deadlocked
--> /checkout/library/std/src/sys/pal/unix/thread.rs:393:66
|
393 | let ret = unsafe { libc::pthread_join(id, ptr::null_mut()) };
| ^ this thread got stuck here
|
= note: BACKTRACE:
= note: inside `std::sys::pal::unix::thread::Thread::join` at /checkout/library/std/src/sys/pal/unix/thread.rs:393:66: 393:67
note: inside `std::thread::JoinInner::<'_, ()>::join`
--> /checkout/library/std/src/thread/mod.rs:1769:9
|
1769 | self.native.join();
| ^^^^^^^^^^^^^^^^^^
note: inside `std::thread::JoinHandle::<()>::join`
--> /checkout/library/std/src/thread/mod.rs:1924:9
|
1924 | self.0.join()
| ^^^^^^^^^^^^^
note: inside `main::_doctest_main_library_std_src_thread_mod_rs_1494_0`
--> library/std/src/thread/mod.rs:1512:1
|
21 | parked_thread.join().unwrap();
| ^^^^^^^^^^^^^^^^^^^^
note: inside `main`
--> library/std/src/thread/mod.rs:1513:3
|
22 | } _doctest_main_library_std_src_thread_mod_rs_1494_0() }
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `<fn() as std::ops::FnOnce<()>>::call_once - shim(fn())`
--> /checkout/library/core/src/ops/function.rs:253:5
|
253 | extern "rust-call" fn call_once(self, args: Args) -> Self::Output;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: some details are omitted, run with `MIRIFLAGS=-Zmiri-backtrace=full` for a verbose backtrace
error: aborting due to 2 previous errors
failures:
library/std/src/thread/mod.rs - thread::Thread::unpark (line 1494)
test result: FAILED. 235 passed; 1 failed; 2 ignored; 0 measured; 1093 filtered out; finished in 39.81s
all doctests ran in 39.94s; merged doctests compilation took 0.08s
error: doctest failed, to rerun pass `-p std --doc`
Bootstrap failed while executing `miri --stage 2 library/std --target aarch64-apple-darwin,i686-pc-windows-msvc -- time:: sync:: thread:: env::`
cc @RalfJung in case you have any ideas if this is a miri problem or a std problem, or if it's only a problem when ran under miri
in case you have any ideas if this is a miri problem or a std problem, or if it's only a problem when ran under miri
This is almost certainly a bug in std that only manifests when run by Miri. So... both?
Oh wait I see apple, so I wonder if this is somehow related to the MacOS RwLock: #121950
Oh! Plausible, I didn't quite make that connection.
I don't see how that could cause a deadlock though, usually it can only cause a UB error.
Also, this doesn't even seem to involve an rwlock, only a condvar, so the problematic queue isn't even involved.
I was actually able to reproduce this with -Zmiri-disable-isolation -Zmiri-many-seeds=0..10000. Seems like we need to use a "real" clock to trigger the error, not Miri's isolated fake clock. Very strange...
Of course this means the "reproducer" is entirely non-deterministic since it depends on the timing of that sleep call...
I managed to catch a trace of a stuck execution. There is a "fun" interaction with std::sys::sync::once::queue::wait: the unpark of the main thread gets "eaten" by the "wrong" park of the other thread. Specifically the 2nd thread is parking while doing something with the STDOUT OnceLock (I think -- I don't see line numbers, but I see a method name like std::sync::OnceLock::<std::sync::ReentrantLock<std::cell::RefCell<std::io::LineWriter<std::io::stdio::StdoutRaw>>>>::initialize::<{closure@std::sync::OnceLock<std::sync::ReentrantLock<std::cell::RefCell<std::io::LineWriter<std::io::stdio::StdoutRaw>>>>::get_or_init<{closure@std::io::stdout::{closure#0}}>::{closure#0}}, !>).
The order of events seems to be something like
- 2nd thread parks (somewhere inside the stdout printing logic)
- 1st thread calls unpark (from its stdout printing logic)
- before the 2nd thread gets the chance to take any step, 1st thread calls unpark again (from
main), which is a NOP since the state is still NOTIFIED - 2nd thread wakes up, and sets the state from NOTIFIED to EMPTY, fully consuming both wakeups
- 2nd thread parks again, and since the state is not EMPTY it'll never wake up again
I first thought this might be a bug in the unparker, but really, isn't this just a bug in the example? If you call park, you must ensure that you are in some sort of queue where someone will call unpark on you later, or someone just called unpark on you "very recently" (i.e., definitely after the most recent time you were parked). The example performs no such coordination. The docs say "unpark followed by park will result in the second call returning immediately", but crucially this is only true if there's no other park still running when that unpark runs. In the doctest, when the main thread calls unpark, it can be the case that the 2nd thread is still inside the park of some other unrelated data structure, and then the unpark gets lost.
Compare that with this code:
rust/library/std/src/sys/sync/once/queue.rs
Lines 258 to 279 in f041962
Here, crucially, the
unpark that the comment refers to must have happened between us entering the queue and us calling park, which means there's no other park that could have consumed it.
(The docs also are worded strangely: "Because the token is initially absent, unpark followed by park will result in the second call returning immediately." But this has nothing to do with whether the token is initially absent or present.)