Soundness of std::sync::Once
pitdicker opened this issue · 10 comments
While working on a refactor of std::sync::Once in #65719 I stumbled across multiple soundness issues. I now believe the current algorithm that queues waiting threads in a linked list with nodes an the stack of each waiting thread is not worth the complexity.
Thread may be parked forever
Edit: this is not a valid concern, see #65796 (comment).
Code:
while !node.signaled.load(Ordering::SeqCst) {
// HERE
thread::park();
}The thread managing the waiter queue (thread 1) sets node.signaled and unparkes a thread.
The thread that wants to wait (thread 2) checks node.signaled before parking itself.
But at HERE thread 1 may set node.signaled and unpark thread 2 (which is not parked yet). Afterwards thread 2 will park itself, and never receive an unpark again.
This can be solved by using park_timeout. It does seems suboptimal to me though.
Aliasing of a mutable reference
let me = &mut node as *mut Waiter as usize;
/* ... */
// the mutable reference to me is send to another thread
let old = self.state.compare_and_swap(state, me | RUNNING, Ordering::SeqCst);
/* ... */
// We are at the same time checking `node` ourselves
while !node.signaled.load(Ordering::SeqCst) { /* ... */ }This can be solved by using shared references and interior mutability.
Use of a potentially dangling shared reference (#55005)
(*queue).signaled.store(true, Ordering::SeqCst);The waiting thread can free queue after signaled is set after a spurious wakeup. At this point store can in theory still hold a dangling reference.
There is not much the implementation of OnceCell can do here, but I suppose if will be solved in the same way as #55005.
This reason std::sync::Once does not go with the obvious implementation that uses a mutex, is because Mutex::new is not const. This is the explanation in the comments:
// So to implement this, one might first reach for a `Mutex`, but those cannot
// be put into a `static`. It also gets a lot harder with poisoning to figure
// out when the mutex needs to be deallocated because it's not after the closure
// finishes, but after the first successful closure finishes.
//
// All in all, this is instead implemented with atomics and lock-free
// operations! Whee! Each `Once` has one word of atomic state, and this state is
// CAS'd on to determine what to do.
The talk about atomics and lock-free make it sound like this implementation might be more optimal then using a mutex. But the park/unpark machinery currently relies on a mutex+condvar per thread. So instead of using one mutex, it is using as many as there are waiting threads.
Alternative implementation
While also a bit tricky, it is still possible to just use a mutex: encode a reference to an Arc<Mutex> in Once.state. The reference will be created when the Once starts running, and not be dropped until the Once is dropped.
I'd like to include the fixes in #65719, and make unother PR with an alternative implementation using a mutex. Also a thing to remember is that if #65719 ever lands, the standard library can just use the Once implementation from parking_lot.
Thread may be parked forever
Heh, I remember almost filing an issue about this as well, but I believe the code is correct. Specifically, if a thread is unparked before it is parked, it will not block. Docs
The unpark method on a Thread atomically makes the token available if it wasn't already. Because the token is initially absent, unpark followed by park will result in the second call returning immediately.
Ah, good! Ok, that crosses off one from the list, and the other is already taken care of in #65719.
Thanks for the report here @pitdicker! I'll try to give my thoughts on the problems here
Thread may be parked forever
As previously mentioned I think this isn't a bug, it's just how park/unpark works.
Aliasing of a mutable reference
This seems fine to me to tweak, I don't really know what the current state of the rules for mutable references and such are. I don't believe it's actually UB today but if we need to future-proof ourselves somehow seems reasonable!
Use of a potentially dangling shared reference (#55005)
I agree that this looks like a rough duplicate of #55005 if I understand that right. To make sure I understand the problem you're thinking about, the moment after we store true into that pointer it's now invalid memory and can no longer be touched. We still, however, technically have &AtomicBool and &Arc<AtomicBool> lying around. While "safe" today because we don't use them, the compiler doesn't know that and various aliasing rules could be trivially violated because &T doesn't actually point to valid memory.
Does that sounds like an accurate summary?
But the park/unpark machinery currently relies on a mutex+condvar per thread.
FWIW we've always designed park/unpark to literally use futexes eventually, we've just never had the need to actually change the implementation at this point.
Alternative implementation
I think the only really important thing with Once is that the fast path needs to be a quick atomic, but everything after that I think can be refactored and changed at will!
Thread may be parked forever
Please amend the issue text to mention that this was an incorrect observation -- I just lost a bunch of time re-discovering the correctness of this code before reading on here. ;)
Also, we should likely add a clarifying comment in the code. Since you are refactoring that anyway, could you include such a comment in your PR?
Aliasing of a mutable reference
That looks dangerous indeed! The question is, what is that me pointer used for in the mean time? Looks like it is shared with other threads, which fields to they access? Once the line with node.signaled is executed, that takes away the permission of me to access the signaled field.
Notice that we already have interior mutability, right? signaled is an AtomicBool.
Use of a potentially dangling shared reference
Confirmed, this is the same problem as in Arc and currently there's nothing a library can do.
Does that sounds like an accurate summary?
Yes.
Please amend the issue text to mention that this was an incorrect observation -- I just lost a bunch of time re-discovering the correctness of this code before reading on here. ;)
Also, we should likely add a clarifying comment in the code. Since you are refactoring that anyway, could you include such a comment in your PR?
Apologies! I added a commit with such a comment.
That looks dangerous indeed! The question is, what is that me pointer used for in the mean time?
It passes an Option<std::thread::Thread> to the other thread, which takes it out. So it does a write through the reference. I could change it to make a clone instead, but using a Cell seemed neater.
Things should be fixed with #65719, but I thought to create this issue to get a little more attention because that one started out with the innocent "refactor" in the title.
It passes an Optionstd::thread::Thread to the other thread, which takes it out. So it does a write through the reference. I could change it to make a clone instead, but using a Cell seemed neater.
But only the other thread accesses the thread field, so I don't see an alias violation here.
The actual problem is the next field: the enqueuing thread does node.next = , asserting uniqueness of the node local variable in terms of access to this field, but the other thread might read it through me!
Basically, all accesses in the enqueuing thread should be changed to go through a raw pointer; the same raw pointer that is also sent to the other thread:
let mut node = Waiter {
thread: Some(thread::current()),
signaled: AtomicBool::new(false),
next: ptr::null_mut(),
};
let node = &mut node as *mut Waiter; // henceforth only use thisusing a Cell seemed neater.
Please don't; there is no need for interior mutability here and it also doesn't help -- the issue is about using node, a "unique" binding, to access things; interior mutability only affects shared accesses. EDIT: Looking at your code, I realize you also changed the &mut node to &node. That could actually work but only if you also do it for the next field.