Tracking issue for improving std::sync::{Mutex, RwLock, Condvar}
m-ou-se opened this issue ยท 117 comments
A few weeks ago, on January 12th, the @rust-lang/libs team discussed a plan to improve std::sync::Mutex
, std::sync::Condvar
, and std::sync::RwLock
.
On most platforms, these structures are currently wrappers around their pthread
equivalent, such as pthread_mutex_t
. These types are not movable, however, forcing us to wrap them in a Box
, resulting in an allocation and indirection for our lock types. This also gets in the way of a const
constructor for these types, which makes static
locks more complicated than necessary.
@Amanieu presented his library, parking_lot
, which implements the 'parking lot' structure from WebKit. Rather than letting the operating system handle any waiting queues, this manages the queues in user space in a global hash table indexed by the address of the lock. This allows for flexible 1-byte mutexes with almost no restrictions.
There has been an attempt at merging parking_lot
into std, as a replacement for the implementation of std's locking primitives. However, many different blockers came up and that PR did not get merged. (See also my RustConf talk for some of this history.) Many of these blockers have been resolved since.
One of the problems with replacing std's lock implementations by parking_lot
is that parking_lot
allocates memory for its global hash table. A Rust program can define its own custom allocator, and such a custom allocator will likely use the standard library's locks, creating a cyclic dependency problem where you can't allocate memory without locking, but you can't lock without first allocating the hash table.
A possible solution is to use a static
table with a fixed number of entries, which does not scale with the number of threads. This could work well in many situations, but can become problematic in highly parallel programs and systems. Another possible solution is to skip the global allocator and directly allocate this hash table with a native system API such as mmap
on Unix. (See this thread for some discussion.)
In our meeting, we discussed what's natively available on various operating systems/platforms:
-
On Windows, we have already switched to Windows SRW locks, which do not require allocation as they can be moved. They are small (32-bit), efficient, and
const
constructible. Just for Windows, there does not seem much advantage to switching toparking_lot
, although there might be some (small) performance difference. -
On both Linux and some BSDs there's a native
futex()
syscall, which provides functionality similar (but more primitive) than a parking lot. Nearly equivalent functionality is available in Wasm. While not trivial, it's possible to implement our synchronization primitives directly based on such a futex. This makes all the primitives small (32-bit), efficient, andconst
constructible. -
On macOS and some other Unixes there doesn't seem to be anything that's similar futex syscall. (
parking_lot
usespthread
for its thread parking operations on those platforms.) -
Some smaller platforms that are (partially) supported by the standard library have their own non-trivial implementation of the standard locks, which are hard to maintain and have not gotten much validation.
After some discussion, the consensus was to providing the locks as 'thinnest possible wrapper' around the native lock APIs as long as they are still small, efficient, and const
constructible. This means SRW locks on Windows, and futex-based locks on Linux, some BSDs, and Wasm. And for other platforms without suitable native APIs, a parking_lot
-based implementation using one of the possible workarounds for the allocation problem.
This means that on platforms like Linux and Windows, the operating system will be responsible for managing the waiting queues of the locks, such that any kernel improvements and features like debugging facilities in this area are directly available for Rust programs.
It also means that porting the standard library to a new platform will be easier, and maintainance of the std implementation for the less common platforms will become easier. These platforms will now only have to implement a thread parker and can rely on a performant and correct implementation of the standard locks on top of that.
Update: We've decided to not directly use parking lot's one-byte (two bit) locks, but instead use the equivalent of its internal WordLock
or Windows' SRW locks, which are one pointer in size and require no global state in the program. That solves the allocation problem.
Update 2: To maintain priority inheritance of the native locks (e.g. on macOS), we've kept using pthread locks on non-futex unix platforms, at least for now. Using #97647, we've made it possible for the locks to have const fn new
.
Primary goals:
- Efficient non-allocating locks
- Windows
- Linux
-
macOS- We can't avoid pthread if we want to keep features of the OS' native locks, like priority iniheritance.
- {Free, Open}BSD
- Turn
std::sync::{Mutex, Condvar, RwLock}::new
intoconst
functions: #97791 - Resolve some bugs around pthreads
Possible later goals:
- Add a
Condvar::wait_rwlock
to makeCondvar
usable withRwLock
s? - Allow
Send
ingMutexGuard
s to other threads?
To do:
- Relax
Condvar
requirements to allow for unboxed mutexes. #76932 - Use unboxed SRW locks on Windows.
- Make Microsoft promise that SRW locks are indeed movable. MicrosoftDocs/sdk-api#447
- #76645
- Refactor
sys_common::Mutex
to have a separateMovableMutex
type to allow unboxing on some platforms. #77147 - Remove the
Box
es inMutex
andCondvar
on Windows and Wasm. #77380 - Remove the
Box
fromRwLock
on Windows and Wasm. #84687 - (Start using
WaitOnAddress
andNtWaitForKeyedEvent
in the standard library.) #77618- This unblocks usage of these APIs in
parking_lot
if we want to use that on Windows too. But if we just use SRW locks, this was not necessary to unblock the lock improvements.
- This unblocks usage of these APIs in
- Use futex-based locks on Linux.
- Start using the futex syscall to get some experience with it in the standard library. #76919
- Implement the
futex()
syscall in Miri to allow Miri to run standard library tests. rust-lang/miri#1568- Also implement the bitset futex operations in Miri. rust-lang/miri#2054
- Implement
Mutex
andCondvar
usingfutex()
.- Design these.
- Experimentation: https://github.com/m-ou-se/futex-lock-experiment/
- Investigate other implementations (glibc, musl, Windows, etc.)
- musl: #93740 (comment)
- glibc
- Mutexes: #93740 (comment)
- Condition variables: #93740 (comment)
- Reader-writer locks: #93740 (comment)
- Boost: #93740 (comment)
- Windows' SRW locks: #93740 (comment)
- Wine's SRW locks: #93740 (comment)
- Apple Darwin libpthread: #93740 (comment)
- FreeBSD's libpthread: #93740 (comment)
- Make some design decisions: #93740 (comment)
- Implementation: #95035
- Design these.
- Implement
ReentrantMutex
usingfutex()
: #95727 - Implement
RwLock
usingfutex()
: #95801
- Use the same
ReentrantMutex
on all platforms: #96042 - Use futex-based locks on *BSD.
- Add NetBSD's FUTEX_* constants to the libc crate. rust-lang/libc#2762
- Add OpenBSD's futex() to the libc crate. rust-lang/libc#2761
- Add FreeBSD umtx functions and constants to the libc crate. rust-lang/libc#2770
- Add DragonFlyBSD's umtx_* functions to the libc crate. rust-lang/libc#2763
- Switch *BSD over to the same futex locks as on Linux. #96510
- Use futex based locks on all other platforms that support a futex-like API.
- Use futex based locks on Emscripten. #96205
- Use
atomic.wait
/atomic.notify
based locks on Wasm. #96206 - Use zx_futex_wait based locks on Fuchsia.
- (Tier 2 platform.)
- Use syscall::call::futex based locks on Redox.
- (Tier 3 platform.)
- Use futexes on Hermit. #101475
- Lazily allocate on platforms where we still use pthread: #97647
- macOS
- NetBSD, other unix
- Make the
new
functionsconst
. #97791 - [future / nice to have] Use 'WordLock' (aka SRW lock, aka 'user space linked-list queue lock') on other platforms.
- Mark this issue as fixed: #85434
-
Write some blog post about all this.(#93740 (comment)) - Celebrate. ๐
Did you investigate using os_unfair_lock
(Apple Developer Docs) on macOS? It can't be moved once it is in use either but it is essentially just a struct with a u32 in it, initialized to 0 before use. But maybe I am missing some requirements here. If it seems possible, investigation of it could be added as a subtask.
It is available since macOS 10.12 on x86-64, but could be used unconditionally on aarch64.
Nevermind it does not have timeouts for os_unfair_lock_lock
/os_unfair_lock_trylock
, so it is obviously out.
I don't think we need to use hash tables like parking lot for targets without futex support. We could just store the queues inside Mutex. I think the only downside is size, but that'll still be better than the current boxed pthread mutex though.
Nevermind it does not have timeouts for os_unfair_lock_lock/os_unfair_lock_trylock, so it is obviously out.
Also it doesn't support Condvar
, which is required for Rust locks.
Oh. This is a subject extremely near-and-dear to my heart -- I even started on a pre-RFC for atomic wait/wake ops, like the ones that C++ got recently. That said, much of my motivation was hoping it would find a way to unstick us here, and I this is clearly a higher priority.
For futex APIs, I wrote a blog post on them at one point. https://shift.click/blog/futex-like-apis I wrote about some of the options here for system APIs, which might be helpful as reference.
That said, I got asked about Darwin's __ulock_*
API in a Zulip DM, since I wrote that post as well as hte ulock-sys
crate.
For Darwin, it I think it's quite unlikely we could use __ulock_{wait,wake}
for a futex API directly without risking... many problems. The fact that it may break in the future is actually one of the less concerning issues, in comparison to the fact that that use of private APIs can cause rejection from app stores, which would be a problem for Rust code that on the iOS/Mac App Stores (for clarity: I'm just going on a hunch1 that could happen for __ulock_*
, I do not know for certain, but I think libstd needs to favor caution here).
That said, there's... a pretty cursed option for a workaround on these targets. Too cursed to use? Not sure... possibly.
That is: libc++
's implementation of std::atomic<u32>::wait
, notify_one
, and notify_all
will ultimately bottom out to calls to __ulock_wait
/__ulock_wake
, which... I guess they get to use because libc++ gets shipped and updated with the OS on Darwin, fair enough. Anyway, in theory we could just call that somehow. And there's a sense in which this would be pretty nice, since it's a system-provided function that implements (something similar to) the API we want... But it's certainly not ideal for a few reasons:
-
It's unclear to me how much we promise about what we link to on various platforms, but it's possible we cannot do this, and likely that we do not want to... I don't know if we make promises about what we need to link against for libstd, but suddenly dynamically linking against the C++ standard library on some platforms... feels like it would be a change.
-
Not to mention -- even if we are allowed to do this, it's still a pain in the butt: The version of libc++ that provides this this is much newer than our platform guarantees, so we'd need to weak-import2 the dubiously-public3 symbol that this stuff under the hood (although I suppose this is better than requiring a C++ compiler to build Rust code for these platforms).
So, maybe? Given that the OS provides libc++
, it seems like it should technically be fine (unless we specifically say we won't). At least, so long as we manage to do it in a way that doesn't versions older than we care about. But maybe I've missed something (or I haven't, but it's still several bridges too far).
(If not, oh well, someday an API shaped mostly like that is bound to end up as part of the libc someday, after it's copied to C from C++ in C9Z ๐)
(EDIT: Thinking about it more, it's fairly likely we should have some justification for this approach, since looking at the libc++ code, enough happens between what our all would be that it could defeat the point. And the effort spent here to make it work would be better spent on improving our fallback)
All that said, there are a few options on various platforms that could probably help implement the fallback thread parker, in case there are regressions (which plausibly there wouldn't be if we use parking_lot_core
as-is, but maybe there would be if we need to abandon the growth behavior). In particular:
-
A few of the futexless unices (NetBSD, the Solaris-likes, perhaps others?) have a
lwp_park
andlwp_unpark
functions, which is basically just an OS primitive for thread parking and unparking. -
On Darwin (yes, yes, more Darwin), I think we could probably use libdispatch's dispatch_semaphore instead, as it's been the target of a lot of optimization -- probably more than the currently-active verions of
pthread_mutex_t
/pthread_cond_t
-- Not to mention, you'd expect that a one-piece semaphore can be implemented more efficiently than a condvar/mutex pair, at least... in theory4. At least, unless the unified semaphore doesn't have to uphold extra guarantees (as it does for POSIX'ssem_t
, I believe). -
Hermit (and perhaps only hermit but perhaps others) only has a semaphore, and we implement other primitives on top of them. I don't have a vested interest in hermit, but I would really like any excuse to remove https://github.com/rust-lang/rust/blob/master/library/std/src/sys/hermit/condvar.rs which look like it's implementing the same algorithm it cites, and seems like it could be buggy5.
That said, a lot of this probably should be in the a tail of followups, and only if it's worthwhile.
Footnotes
-
I mean, if nothing else they start with
__
and don't appear in a single public header (they only appear in the headers inside apple source repos). So, even though these APIs don't seem to be in the blacklists that prevents you from even attempting app store submission (which is not a guarantee), it seems completely out of the question to use in libstd (even if I do think they're cool, and I do). โฉ -
Calls to
dlopen
/dlsym
tend to be disallowed in the same situations, since tooling can't tell if you're using it to use private APIs. (We already do this all over libstd though, so maybe it's not the issue it once was) โฉ -
Which isn't API-compatible, but presumably has to stay around for ABI reasons. I mean, it must, right? C++ standard libraries are famously resilient to ABI issues, after all ๐ ... โฉ
-
That is, at worst the single-piece semaphore could just be implemented in terms of a private condvar and mutex pair, resulting in the same performance. That said, relying too heavily on this style of logic is a good way to get unpleasant surprises, so who knows. โฉ
-
For example, the load-then-sub in
notify_one
is seems like it has a race that could lead to a deadlock, but there may be a reason this is impossible -- I've never spent that long thinking about it, just seems dubious, and like the kinda code that'd be real nice to replace with something that (if nothing else) will get more attention. โฉ
On both Linux and some BSDs there's a native futex() syscall, which provides functionality similar (but more primitive) than a parking lot. Nearly equivalent functionality is available in Wasm. While not trivial, it's possible to implement our synchronization primitives directly based on such a futex. This makes all the primitives small (32-bit), efficient, and const constructible.
Please make sure you implement a fair or eventually fair (this is what parking lot does and needs the hashmap for) lock to prevent starvation.
I don't think pthread mutexes are fair so the new implementation shouldn't be required to be fair either.
SRWLOCK is also unfair too, pthread_mutex_t
depends (POSIX doesn't require it, apple has it, I believe glibc doesn't)... So this would mean we need to a parking lot everywhere, even if the OS has otherwise good locks (like with Window's SRWLOCK).
More generally: There are a bunch of locking features that would be nice to have in an ideal world, but I think they start becoming contradictory, or only achievable easily on one platform or another -- for example, I believe the parking-lot style of lock won't respect OS priorities (unless there's a clever trick I don't know about or something). Trying to solve all of this probably has no solution, and I'm not sure that we're in the right position decide which things we want to promise yet1... assuming we even want to make these sorts of promises (and I'm not sure we do).
Footnotes
-
I feel like the place for that is afterwards we ship it, if/when bugs come in from users and such. Or at least when we're further in the implementation. โฉ
Apple switched the default from PTHREAD_MUTEX_POLICY_FAIRSHARE_NP to the newly introduced PTHREAD_MUTEX_POLICY_FIRSTFIT_NP in macOS 10.14 (iOS and tvOS 12.0, watchOS 5.0), so they are not fair for recent macOS either (by default).
Wouldn't implementing lock algorithms on top of futex et al. effectively mean that both the parking_lot
based implementations and these other algorithms receive less test/use coverage overall? Or is there some plan to share the gnarly implementation details between parking_lot based implementations and the futex/etc. ones?
@nagisa Yes. That's an issue with all platform-specific code, I suppose. But we do plan to also have the CI test the generic implementation (parking lot) on Linux as well, so it gets test coverage.
I'm now reading through the various popular lock implementation to see how those are implemented.
Starting with musl.
Mutexes
Musl's pthread_mutex_t
is 40 bytes (on 64-bit), although for regular/normal mutexes, only two 32-bit integers are relevant/used: the lock
futex and the waiters
counter. (Pthread mutexes support various attributes and different types which are all runtime settings, and thus stored inside the mutex itself. Depending on the type, other fields in the mutex are used. In Rust, however, different types are tracked in the type system, not dynamically inside the mutex.)
Its implementation is relatively simple. The lock
integer contains just two bits: one to indicate whether the mutex is locked, and one to indicate whether it's contended. The contended bit can only be set when the lock bit it set as well. Locking tries to compare-and-swap 0 to 'locked' (spinning 100 times if necessary), and otherwise blocks on a 'futex wait' on the lock. Before waiting on the futex, the high bit of the lock is set and the waiter count is incremented. After waking (whether spurious or not) the waiter count is decremented.
When unlocking, a 0 is unconditionally written to the lock, and when the high bit was set or the waiter count was nonzero, a one thread is awoken through the futex.
It's not entirely clear to me how useful the waiter count is. This mechanism would also work correctly with only the 'contended'-bit. The waiter count is only used by the spin loop while locking, to avoid spinning and sleep directly when there's other waiters, to avoid stealing the lock in front of others. This makes things a little bit more 'fair'. See commit f5fb20b.
Condition variables
Musl's pthread_cond_t
is 48 bytes (on 64-bit), although for 'normal' condition variables, only two pointers and a 32-bit lock/futex is used.
The lock protects a doubly linked list pointed at by the two pointers. This doubly linked list is the waiting queue, where each node is stored on the stack of a waiting thread. Each node contains a state (waiting, signalled, or leaving) and a futex that the thread is waiting on. pthread_cond_signal
wakes up the first thread on the list that wasn't signalled yet. pthread_cond_broadcast
marks all nodes as 'signalled', but only wakes up one thread. That thread, after woken up (and locking the mutex), will requeue the next signalled thread onto the mutex. That way, there's no need to store a pointer to the mutex in the condition variable, as pthread_cond_broadcast
doesn't have to know about the mutex. It only requeues one thread. When that thread is woken up, that one will requeue the next one, and so on.
Reader-writer locks
Musl's pthread_rwlock_t
is 56 bytes (on 64-bit), although it only uses two 32-bit integers in the normal case: one lock
futex, and one waiters
count. Its implementation is very similar to pthread_mutex_t
, except 31 bits of the lock futex are used to count the number of active reader locks. The maximum value is used to indicate a writer lock. The other bit is used as the 'contended' bit, and is used for both readers and writers. It does not prioritize writers, allowing more readers to lock the lock even if there's writers waiting. The reason for that is explained in commit 50304f2: posix requires reader locks to be recursive. Proritizing writers could make a second reader lock on the same thread block.
Read-unlocking a contended lock will wake up one thread. Write-unlocking a contended lock will wake up all threads. The latter could cause another writer thread and many reader threads to wake up at the same time, potentially resulting in the writer thread winning the race and all the reader threads going back to sleep again.
The waiter
counter has the same purpose as in pthread_mutex_t
above.
In general it's interesting to see how many implementation details are relevant to Posix, but not to us here in Rust. For example, commit c68de0b mentions how Posix allows destroying and unmapping a mutex even if another thread hasn't returned from pthread_mutex_unlock
yet. In Rust, this is impossible due to borrow rules. And on top of things like that, Posix supports more types of locks (recursive, process-shared, priority-inheriting, ..) all within the same type, which we do not need to support (within the same type).
Tl;dr: Musl uses trivial rwlocks and mutexes, except it separately counts waiters to avoid spinning when there's other waiters. Rwlocks don't prioritize writers because Posix wants reentrant reader locks. Condition variables are not trivial; they keep a waiter queue as linked list in user space, and requeue threads one by one.
Next: glibc's mutexes. (Glibc's condition variables and rwlocks are a bit more complicated. I'll come back to those later.)
Mutexes
Glibc's mutexes are 40 bytes and support a lot of different properties. The 'normal' kind, (PTHREAD_MUTEX_TIMED_NP
) however is very simple, and only uses three 32-bit fields: the lock futex, the owner thread id, and the number of users. The last two fields are useful for debugging and dead-lock/misuse detection, but not used for the basic functionality of the mutex. (The n_users
field represents the current number of threads that hold the lock (0 or 1), except it doesn't get decremented when pthread_cond_wait
temporarily unlocks the mutex, which means that those waiting threads are also included in this number.)
The normal mutex, unlike the many other variants, is very simple. It has three states. 0 for unlocked, 1 for locked without waiters, and 2 for locked with waiters. Locking tries to compare-and-swap 0 to 1, or otherwise swaps in a 2 and futex-wait's for things to change. Unlocking swaps a 0 in place, and futex-wake's one waiting thread if it was set to 2. Unlike musl's implementation, it does not try to spin at any point, and directly switches to sleeping if the compare-and-swap fails.
It's basically identical to the one I wrote in my experiment: https://github.com/m-ou-se/futex-lock-experiment/blob/ab0e9a135aff5f3b2e01218a3281cbc552d76653/src/lib.rs#L30-L109
-
Has anyone looked into the condition variable darwin libpthread? Works with a futex-based api (ulock), uses 64bits of space, doesn't do extra wakes or require requeue, and is simpler than glibc's condition variable implementation.
-
Would
os_unfair_lock
be possible for darwin systems? It supports priority inheritance, is 32bits large and movable, but is only available on macos 10.12+ -
For non-futex/windows platforms, a thread parker + parking_lot's WordLock could be used for the Mutex. There's also windows' CONDITION_VARIABLE and SRWLOCK source to look into and some inspection suggests the latter is very similar to WordLock (which allows it to be a usize in space) but with more complications.
Would os_unfair_lock be possible for darwin systems? It supports priority inheritance, is 32bits large and movable, but is only available on macos 10.12+
Lack of timed wait is a problem, and libstd supports back to 10.7+ (and I'm not sure what's required to reconsider this).
Ah, fair point for the 10.7+ requirement. Does the new Mutex
impl need timed wait? SRWLock and the current API don't support it.
It's supported via condvar: https://doc.rust-lang.org/std/sync/struct.Condvar.html
Continuing with glibc's condition variables:
Condition variables
Glibc's condition variables are 48 bytes, and all of them are used in normal operation. Unlike musl's implementation, this one does not keep a queue in user space, as the same implementation is used for process-shared condition variables, which cannot refer to any memory addresses that might be different in different processes.
It keeps a 64-bit sequence number that's incremented by every new waiter, and a second 64-bit number that indicates which of those have been signalled. Within the structure, there are futexes and some other fields for exactly two 'groups' of waiters: G1 and G2. Waiters add themselves to G2, and signals wake up threads in G1. When G1 is empty, signalling first swaps the two groups. This is a workaround for bug 13165, which is about time-traveling signals that end up waking future waiters rather than current waiters.
Two bits of the condition variable represent its internal lock (a trivial three state mutex), which protects the state of the condition variable.
There is some additional complexity about destroying a condition variable while there are still waiters. However, this is irrelevant for our Rust implementation, since we don't allow dropping things that are still in use.
This implementation raises the following questions for a Rust implementation:
- Do we need a 64-bit counter? A futex (on Linux and BSD, today) is 32-bit, so having 64-bit counters makes things more complicated. A 32-bit counter could potentially overflow all the way until it reaches the same value while unlocking the mutex in
condvar.wait()
, although that seems extremely unlikely. - Does a trivial (single futex) condition variable implementation suffer from the same 'time traveling signal' bug? If so, do we consider that a bug as well in Rust?
More on glibc's condition variables:
Their old design before the bugfix for the time-traveling signals is documented here in pseudocode: https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/DESIGN-condvar.txt;h=4845251c7586c9667cd33c38af3701910d65eaa8;hb=c0ff3befa9861171498dd29666d32899e9d8145b
It keeps a (64-bit) counter for the number of signals that have been sent, and another for the number of threads that have been woken up. The difference between those two is effectively the number of signals ('tokens') that can still be consumed. If a waiting thread wakes up from the futex-wait operation and there are no tokens left (i.e. the counters are equal), it goes back to sleep. That's how a future thread can consume a token and prevent another waiting thread from waiting up from pthread_cond_wait
.
llvm libc for comparison:
https://github.com/llvm/llvm-project/blob/main/libc/src/threads/linux/CndVar.h
Finishing up glibc:
Reader-writer locks
Glibc's rwlock
is documented here. It is 32 bytes, of which 22 are used during normal operation. It contains two futexes, a readers counter, a writers counter, the thread ID of the current writer, the selected mode, and a flag indicating whether it is process-shared or not. It supports three different modes: a) prefer readers (default), b) prefer writers with recursive reader locks, c) prefer writers without recursive reader locks. Only in the last mode do waiting writers block additional reader locks.
Part of its complexity is due to supporting these different modes within the same structure. Another part of the complexity is due to posix allowing destruction in cases where the Rust borrow checker would not allow it. Both of these are not relevant to our implementation in Rust.
The reader counter also contains three state bits. Using it directly as a futex is avoided, in part because that number can rapidly change as readers come and go. Therefore, one of the state bits (whether the lock is in reader or writer mode) is duplicated in a separate futex.
It also includes a mechanism through which writers can hand over a writer lock to each other without unlocking, which is only used if the lock is set to a mode that prefers writers.
It has nine states, each of which is either part of the 'read phase' (1 through 4a) or the 'write phase' (5 through 8). Even the idle state is split in both a 'read phase' (1) and a 'write phase' (5) idle state. It's entirely symmetrical, except for one more state (4a) in the read phase which is used in the non-recursive writer-preferred mode to block new waiters even when the lock is already read-locked. The transition from one of the idle states to the locked state in the other phase (2 or 7) goes through a intermediary state (3 or 6), during which a thread trying to lock the lock in the 'current phase' (read/write) can prevent the phase change and steal the lock. (This intermediary state is not used for the 'try lock' functions.) The locked states (2 and 7) are split into two states where one indicates there are waiters of the opposite phase (4/4a and 8).
Below is a state transition diagram of this lock. The dashed arrows are only used if writers are preferred. The thin arrows are only used by the 'try lock' functions.
The thread ID of the current writer that's stored inside the lock is only used for debugging and deadlock detection when the holder of a write lock tries to acquire a read lock.
Next up: Wine
We don't have the source code to Microsoft's SRW locks and condition variables, but we do have the source code to Wine, which implements these as well. However, while they have to be the same size as Microsoft's implementation, they are implemented very differently and you'll see very different bit patterns in them if you inspect them while they are in use.
Reader-writer locks (and Mutexes)
On Windows we use SRW locks for both RwLock
and Mutex
, so there's no separate mutex implementation we're looking at. They are the size of a pointer, but Wine's implementation only uses 32 bits, regardless of pointer size. It is split into two 16-bit counters: The number of active reader locks ('owners'), and the number of waiting writers. The reader/owner counter is set to u16::MAX
when it is write-locked.
The lock and unlock functions are very simple. Write-locking increments the waiter counter (using a 16-bit operation), and then uses a 32-bit CAS operation on both counters at once to acquire the lock if the owner counter is zero by setting it to u16::MAX
and decrementing the waiter counter again. If the owner counter wasn't zero, it'll do a futex1 wait to wait until it is.
Read-locking uses a CAS operation to increment the number of owners only if there's no waiting writers and the lock isn't write-locked. So, waiting writers block new readers. If the conditions aren't met, it futex-wait's until they are.
Interestingly, two different addresses are used within the SRW lock as two different futexes. Readers wait on the address of the SRW lock itself and use the full 32-bit value, while writers wait only on the second half which contains the 16-bit owner counter. Unlocking the last reader lock will wake a single thread from the writer queue. Unlocking a writer lock will wake a single writer if there was any waiting (indicated by the waiting counter), or wake all readers if there are no waiting writers.
Mixing 16-bit and 32-bit atomic operations on overlapping memory is not common, and the Intel manual (vol. 3A) seems to suggest this is not a great idea:
Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths.
Also, surprisingly, no overflow checks are done. If either of the counter overflows, unexpected things happen.
Condition variables
Wine's condition variables are trivial. Just like SRW locks, they are the size of a pointer, but Wine's implementation only uses 32-bits. It is simply a 32-bit counter that gets incremented on every notification. It does not keep track of which lock it is used together with, nor does it make use of requeueing. It does not keep track of any information to prevent a futex-wake operation when there's no thread waiting, etc. It is identical to Condvar1
in my experiment.
Footnotes
-
They are not called 'futexes' on Windows, but the operations are identical:
WaitOnAddress
,WakeByAddressSingle
, andWakeByAddressAll
. โฉ
Next up: Windows
Now this one is a bit tricky because the source code isn't available, but I made some guesses and have good reason to believe they are accurate.
Reader-writer locks (and Mutexes)
On Windows we use SRW locks for both RwLock
and Mutex
, so there's no separate mutex implementation we're looking at. SRW locks are one pointer in size, of which the last four bits encode four flags. The other bits either represent a (60-bit or 28-bit) counter, or they represent a (64-bit or 32-bit) pointer with the last four bits missing. That pointer points to something 16-byte aligned, to make sure the last bits are always 0.
When there's no contention, the number of active read locks is stored in the counter. For a write lock, the counter is 0. The least significant bit is used to indicate whether the lock is locked (regardless of read or write mode). An unlocked lock is entirely zero.
When there are threads waiting, that is indicated by one of the four flag bits, and the counter is no longer used and instead used as a pointer to a doubly linked list of waiting threads. The nodes of that list are stored on the stack of the waiting threads. Each node also contains a pointer to the last node in the queue, so you can quickly get there from the first pointer. (The lock only stores a pointer to the first node.)
The last node contains the number of active read locks, since the pointer took the place of that counter within the lock itself.
Each node contains the thread id of the waiting thread, and an additional flag that indicates whether the thread is waiting to acquire a read or write lock.
The thread id is used with the undocumented NtWaitForAlertByThreadId
and NtAlertThreadByThreadId
APIs. These seem to be effectively identical to Rust's thread::park()
and Thread::unpark()
functions. So, SRW locks are not futex-based, but thread parker based.
I'm assuming some of the flag bits in the lock are used as a mutex to protect modifications to the queue.
A waiting writer on a read-locked lock prevent new readers from acquiring the lock, to avoid writer starvation, but it doesn't exactly prefer writers in general: it ~roughly attempts to handle the requests in order. SRW locks are not fair.
Like most lock implementations, it first spins a while before going to sleep, to avoid expensive syscalls when the lock is only locked for a very short time. Spinning seems configurable through some global. It also seems to support an alternative to spinning based on the monitorx
and mwaitx
instructions.
Condition variables
Condition variables on Windows are also a single pointer. They work very similar to SRW locks, storing a pointer with flag bits in the least significant bits, with the pointer pointing to a linked list over waiting threads. The nodes even have the same structure as the nodes used by SRW locks.
WakeAllConditionVariable
iterates over the entire list and wakes all threads. It does not requeue waiters to the relevant lock.
Here's a diagram showing an SRW lock in different states:
A condition variable looks nearly identical, except for the locked states and the four flag bits.
tl;dr: SRW locks and condition variables on Windows are thread parker based and keep their waiting queue as a linked list in user space through nodes on the stack of the waiting threads. Thread parking is natively supported on Windows through an undocumented API. It prefers writers to avoid writer starvation. They spin before sleeping, and can use the special monitorx
and mwaitx
instructions as an optimization.
And then we have: boost (on Posix)
Mutexes
Boost's mutexes (boost::mutex
) on Posix platforms are simply a wrapper around pthread_mutex_t
. (So, using glibc, these are 40 bytes.) (Implementation here.)
Condition variables
Boost's condition variables (boost::condition
) on Posix platforms are a wrapper around pthread_cond_t
, but with an additional pthread_mutex_t
. (So, using glibc, these are 88 bytes.) The mutex is used to allow for interruptions/cancellation. (Implementation here.)
Waiting on a condition variable is an interruption point. However, if waiting would result in pthread_cond_wait
on the mutex that the user wants to wait on, boost would not be able to interrupt that when that mutex is still locked, since pthread_cond_wait
will only return after locking that mutex. So, instead, boost's condition variable waits not on the mutex the user provided, but on its own internal mutex. To avoid race conditions, that mutex is locked on every operation, including when sending notifications.
Reader-writer locks
Boost's reader-writer locks (boost::shared_mutex
) store 8-byte state protected by a mutex, together with three condition variables. (So, using glibc, these are absolutely huge: 312 bytes, spanning five cache lines.) (Implementation here.)
The state contains a 32-bit reader lock counter, and three boolean flags: writer locked, writer waiting, and upgrade. (The last one is only used for upgrading locks (read lock โ writer lock).)
One condition variable is used for notifying writers, one for notifying readers, and one is used for upgrading locks.
Every operation locks the internal mutex before it inspects or changes the state. If the state doesn't allow for the thread to continue yet, it waits for it to change using the right condition variable.
Unlocking a writer lock or unlocking a the last reader lock results in notifying both the reader and writer condition variable. One writer and all readers are woken up.
This implementation isn't terribly efficient, but it's easy to prove correct as it's all protected by a single mutex.
These locks prefer writers. A waiting writer on a read-locked lock prevent new readers from acquiring the lock, to avoid writer starvation.
(There is also a 'v2' implementation which is similar but slightly smaller. (224 bytes on glibc.) It does not support upgrading and only uses two condition variables, and it only notifies one condition variable (either the readers or the writers, not both) on unlock.)
tl;dr: Boost on Posix platforms uses only standard pthread locks, and does not directly use any platform-specific things like futexes. The mutex is just whatever the underlying libc provides (see glibc and musl above). The condition variable is mostly a wrapper around libc's version, except for an additional mutex for a trick to provide thread cancellation. But its read-writer locks are not just pthread_rwlock_t
, and instead implemented on top of a mutex and three condition variables, making them very big. It prefers writers to avoid writer starvation.
Futexes on Linux seems like a good plan. Remember to read "Futexes Are Tricky" if we're going down that road: https://dept-info.labri.fr/~denis/Enseignement/2008-IR/Articles/01-futex.pdf This is basically an explanation of glibc's mutex implementation, as I understand it.
I would highly recommend just copying an existing mature C/C++ implementation of concurrency-primitives-built-on-futexes instead of trying to invent something new.
Do we need a 64-bit counter? A futex (on Linux and BSD, today) is 32-bit, so having 64-bit counters makes things more complicated. A 32-bit counter could potentially overflow all the way until it reaches the same value while unlocking the mutex in condvar.wait(), although that seems extremely unlikely.
It's not so much accidental overflow that is the concern here, but rather attacker-controlled overflow causing memory unsafety. This stuff can be surprisingly attacker-controllable if exploited in the right way.
Futexes on Linux seems like a good plan. Remember to read "Futexes Are Tricky" if we're going down that road: https://dept-info.labri.fr/~denis/Enseignement/2008-IR/Articles/01-futex.pdf This is basically an explanation of glibc's mutex implementation, as I understand it.
That's an old version of that paper. A newer version is available here.
It explains the design of basically the most simple futex-based mutex by taking a bit of a detour through an obviously incorrect mutex first. (And a second implementation based on the assumption that Atomic*::swap
does not exist.) The final simple mutex implementation it describes is indeed what glibc's mutex comes down to in practice. As mentioned above, it is identical to the mutex I implemented in my experiment before I started investigating the various existing implementations: https://github.com/m-ou-se/futex-lock-experiment/blob/ab0e9a135aff5f3b2e01218a3281cbc552d76653/src/lib.rs#L30-L109
The paper does not describe condition variables or read writer locks, which are far more interesting.
I would highly recommend just copying an existing mature C/C++ implementation of concurrency-primitives-built-on-futexes instead of trying to invent something new.
We haven't invented anything new here. My 'newly invented' experimental mutex implementation turns out to be identical to the one in glibc and the one in the paper you mentioned. My 'newly invented' experimental condition variable turns out to be identical to the one in GLib and Wine. (Not that these will necessarily be the ones we continue with.)
Of course we should be and will be very careful with going with anything that differs from what we already see in other implementations. The goal is not to 'invent something new', we are exploring what existing implementations do and learning which of their quirks and complexity is and isn't relevant for us, to be able to make an informed decision on what to copy and what not to copy.
My conclusion from all the C and C++ implementations I've investigated so far, is that a lot of their complexity comes from requirements that we do not have in Rust, such as allowing dropping while still in use, detecting misuse that Rust already statically detects through the borrow checker, runtime configuration such as whether to allow re-entrance or whether to prefer readers or writers, allowing thread interruption/cancellation, etc. etc. Just copying an existing implementation could mean we end up with a very complex Condvar
that will be extremely hard to review and maintain, or an absolutely huge RwLock
of 312 bytes that could've been 4 bytes.
It's not so much accidental overflow that is the concern here, but rather attacker-controlled overflow causing memory unsafety.
You cannot cause memory unsafety with such an overflow. The Condvar
implementation in my experiment doesn't even use any unsafe code at all. Missing exactly 4294967296 notifications while unlocking the mutex in Condvar::wait
would only cause the thread to go to sleep when it should not. While this may in some situations result in denial of service, that's not really an issue in the situation where an attacker has already gained enough control to completely block the thread in the exact right moment for long enough to make it miss over four billion notifications.
GLib has a quite simple futex mutex/cond implementation without any bells and whistles, if you want to take a look at another one. It's probably not very different from the one you have though, there's only so many things you can possibly do differently.
GLib has a quite simple futex mutex/cond implementation without any bells and whistles, if you want to take a look at another one. It's probably not very different from the one you have though, there's only so many things you can possibly do differently.
Thanks. It's indeed basically identical, except it uses an atomic add rather than a compare exchange when locking the mutex, although it does overwrite it with a 2 right afterwards if it is incremented higher than 1. This means that the state of the mutex can temporarily get higher than 2 and has a slightly higher chance of unnecessary EAGAIN
returning futex wait calls. But (at least on x86) results in slightly simpler machine code. It also means it's theoretically possible to overflow the mutex causing memory unsafety.
Would also like to recommend semaphore based synchronization primitives if it hasn't been considered already, and if simplicity (but not necessarily throughput) is a top priority.
Related, but for a performant, futex-based, RwLock there's also this algorithm.
Here is a quick analysis of Apple Darwin's pthread implementation:
I'm not going into as much detail as the other implementations. MacOS has native (undocumented) syscalls for things like mutexes and rwlocks (psynch), but also supports a not-fully-documented futex-like syscall (ulock_wait). By default, the pthread locking types use their psynch
based implementation, but there is also an alternative ulock
'futex based' implementation.
All of pthread_mutex_t
, pthread_cond_t
and pthread_rwlock_t
in this implementation contain fields which need a higher alignment than the alignment of the struct itself. This issue is worked around by adding padding after the fields, and dynamically calculating the offset of the field. This means that this now seems to be the first implementation I've seen where the types are truly not movable. Moving one of these to a differently aligned address would shift around those 'overaligned' fields.
pthread_mutex_t
is 64 bytes, pthread_cond_t
is 48 bytes, and pthread_rwlock_t
is 20 bytes.
Mutex
The mutex locking functions do not seem to spin in user space before calling into the underlying layer to wait for the mutex. They do atomic compare-exchange operations to avoid these calls on an uncontended lock, but do not spin for some time to wait for the lock to be released.
Similar to all other implementations, it also avoids the wake/unlock syscall on unlock when there is no contention.
Condition variables
Neither of the implementations of condition variables seem to requeue waiters onto the queue of the mutex, but instead wake up all threads at once. I can't see what __psynch_cvbroad
ends up doing in the kernel, but it does not get a pointer to the mutex, so requeueing seems out of the question.
The struct does store a pointer to the mutex used by the wait operation, but it is only used for checking correct usage, and cleaning up when a thread gets cancelled.
Futex-based Condition variable
The most important part of the futex-based condition variable is the 64-bit state, containing a 32-bit sequence/notification counter, a 16-bit waiter counter, and a 16-bit counter for unconsumed signals. From a brief inspection of the code, it seems that it suffers from the same issue as the trivial condition variable: if it misses exactly 4294967296 notifications while unlocking the mutex, it can go to sleep when it shouldn't. That's nearly impossible in practice though.
The two 16-bit counters make it possible for pthread_cond_signal
and pthread_cond_broadcast
to skip the wakeup syscall in case no thread is waiting, and also in case all waiting threads have already been signalled but are still waking up.
Reader-writer lock
The default reader-writer lock prefers writers in the sense that new readers on a read-locked lock are blocked if there are any writers waiting. However, those new readers do seem to get priority over any writers that come later. That is, all readers and writers seem to be part of the same queue.
And here's some details of FreeBSD's pthread implementation:
All of the pthread locking types are 'boxed' on FreeBSD. They are just a single pointer, pointing to the heap-allocated implementation. (This means that today, std::sync::Mutex on FreeBSD is effectively a Box<Box<ActualMutex>>
. :( )
The kernel directly supports native syscalls for mutexes, condition variables, and read writer locks. These types are not futex-based, so of less interest to us.
Locking optionally spins with a spin loop hint, and then optionally spins with thread::yield()
, before calling the locking syscall. Neither is enabled for the default mutex type.
Mutexes contain a bunch of pointers to make them part of several intrusive (doubly) linked lists, which are used to keep track of all the currently locked (and condition-variable-waited-for) mutexes used by a process, which is used for handling terminated threads/processes, and to patch up ownership when forking.
Broadcast notifications on condition variables do not result in the kernel requeueing the waiters to the mutex, but simply wake up all threads.
Based on everything I've seen in the other implementations, I think these are some of the relevant design questions for futex-based locks:
1. Prevent unnecsesary wake syscall when unlocking without contention?
Univerally the answer in all implementations is yes. Without this, even an uncondented lock would require syscalls.
However, there is still a trade-off to be made about how accurately this should be done. Most implementations only keep a single 'contended' bit, which means the last unlock in a series of contended unlocks cannot know it was the last one, and will do an unnecessary syscall. This is true for at least: Glibc, Glib, Musl, and Wine.
Some implementations track more information so they know exactly if there are any waiting threads, removing an additional unnecssary syscall in the contended case. How much this is worth is debatable. This is done by at least: Microsoft, and parking_lot.
(Answers: No (none) / Yes, but not fully accurate (most) / Yes, fully accurate (some))
2. Do we keep the wait queue(s) in user-space?
Microsoft's SRW locks and condition variables, musl's condition variable, and parking_lot all keep their own waiting queue in user space, giving them full control over it and making it possible to avoid wake syscalls if there are no threads waiting.
Glibc, Wine, Glib, Apple, FreeBSD and musl (except for musl's condition variables) all do not construct any queues in user space, and leave it to the kernel to handle that. (Though Apple and FreeBSD might not be very relevant when designing for Linux, as the kernel most likely behaves differently.)
(Answers: Queues in user-space (some) / Leave it to the kernel (most))
3. Spin before the syscall when locking?
Musl, Microsoft, and parking_lot spin for a while when attempting to lock an already locked mutex before resorting to a syscall.
GLibc (by default), Glib, Wine, Apple, FreeBSD (by default) do not do this.
(Answers: Yes (some) / No (most))
4. Prefer writers over readers to avoid writer starvation?
Microsoft, Wine, Boost, parking_lot, and Apple all block new readers when there's a writer waiting. Glibc (by default) and Musl do not do this.
(Answers: Yes (most) / No (some))
5. Requeue on notify-all/broadcast?
Musl and parking_lot only wake up one waiter on notify_all
(aka broadcast
), and requeue the other waiters to the wait queue of the mutex, only to be woken up when the first thread unlocks the mutex.
Glibc, Glib, Wine, Microsoft, Apple, and FreeBSD simply wake up all threads and let them race for the lock.
(Answers: Yes, requeue (some) / No, wake up all (most))
6. Suppress spurious wake-ups?
Glibc and parking_lot avoid spurious wake-ups when waiting on a condition variable. If the wait syscall spuriously wakes up, they will repeat the call without returning.
Glib, Musl, and Wine do not do this. (I'm not sure if and when spurious wake-ups happen here on MacOs or FreeBSD, but they do not seem to avoid them in user space.)
(Answers: Yes (some) / No (most))
7. Prevent unnecsesary wake syscall when notifying a condition variable without waiters?
Trivial implementations like Wine and Glib do not do this. notify_one
and notify_all
always do a syscall.
Microsoft, Musl, and parking_lot avoid it through their user-space queue. This has the downside that notify_all
does one syscall per waiting thread rather than one syscall in total.
Glibc avoids it while still using only one syscall, at the cost of higher complexity. (Leading to the time traveling signal bug in the first implementation, as mentioned above.)
(Answers: Always do a syscall on notify / notify_all
does one syscall per waiting thread (so, zero if there's none) / notify_all
does one syscall in total if there's any waiting thread (so, zero if there's none))
I would really like to see benchmarks for most of these choices (1, 2, 3, 5, 7), as they trade off simplicity for performance, so they'll be easier to make when you know how much performance you lose.
- I think it is very much in the spirit of Rust's focus on correct concurrency to prevent writers from starving. We should try to avoid the other kind of UB (unexpected behaviour) wherever possible.
I think it is very much in the spirit of Rust's focus on correct concurrency to prevent writers from starving. We should try to avoid the other kind of UB (unexpected behaviour) wherever possible.
As mentioned before, this is not clear-cut: that approach would mean that if a thread that holds a read lock acquires another read lock (entirely harmless by Rust standards, just two shared references in the same thread), this can lead to a dead-lock if there is a writer waiting. This will only happen rarely though, making it a pain to debug.
In other words, there is unexpected behavior either way. Avoiding unexpected behavior might be possible by making it so that when a writer is waiting, no new threads will be granted read locks, but threads already holding a read lock will be granted further read locks -- but implementing that will probably cost too much performance, I would assume.
I would really like to see benchmarks for most of these choices (1, 2, 3, 5, 7), as they trade off simplicity for performance, so they'll be easier to make when you know how much performance you lose.
It's very tricky to make realistic benchmarks, or even decide what 'realistic' really is. There are many different usage patterns of locks, and optimizing a lock implementation for a specific benchmark is easy but not that useful.
Many of the choices cannot be made independently from each other, and have consequences for the complexity (and size) of the lock. There is also value in keeping them simple and small, next to keeping them performant.
I think it is very much in the spirit of Rust's focus on correct concurrency to prevent writers from starving.
Yes, Pretty much every implementation does this, except the ones that strictly follow the Posix requirements. The main downside is that recursive read locking might result in a deadlock, as Ralf explained.
We discussed the design of several possible implementations in the library team meeting yesterday. Here's a summary of our most important conclusions:
- A 'trivial' futex-based implementation of these locks, like in my experiment, seems like a good way to go. It is simple and easy to implement and verify, as it leaves most of the complexity to the kernel. None of the more complex ways to implement these locks seem to have big advantages for Rust.
- The only thing unclear is what to do with spinning when locking. It's simple to add it, but some of the most used implementations don't do it. It works out well in benchmarks, but how it works out in realistic programs is questionable.
- Not requeueing but waking up all Condvar::notify_all is fine. In some cases that might be more performant, and in general waking up a lot of threads to wait on the same mutex is a bit of an uncommon use case that will by its very nature never be very performant.
- A theoretical deadlock on exactly 4294967296 missed notifications when unlocking a mutex in Condvar::wait is fine. It's virtually impossible, and it does not result in UB. If an attacker can block the program in exactly that spot for this long, they already have enough control over the program to not need to exploit this to block the program.
- Avoiding spurious wake ups brings more complexity than it is worth.
- Avoiding writer starvation in RwLock is important.
- Windows' SRW lock's user-space queue is very similar to parking_lot's internal WordLock, and an interesting alternative for parking_lot as it works on all possible platforms. It doesn't need a futex; it can use anything that can be used as a thread parker. The downside is that the locks become a pointer in size rather than just a byte (or two bits), but the advantage is that avoids the complexity of a global parking lot hashmap.
Started the Linux implementation for Mutex and Condvar here: #95035
This (draft) implementation does not (yet) spin before locking.
Windows' SRW lock's user-space queue is very similar to parking_lot's internal WordLock, and an interesting alternative for parking_lot as it works on all possible platforms. It doesn't need a futex; it can use anything that can be used as a thread parker. The downside is that the locks become a pointer in size rather than just a byte (or two bits), but the advantage is that avoids the complexity of a global parking lot hashmap.
I should note that it is also a very reasonable option to use this kind of implementation on all platforms, including those where futexes are available.
I have a preference for using the simpler futex-based locks on platforms where that is possible, but I believe @Amanieu has a slight preference for using the SRW/WordLock-style user-space queue based locks on all platforms.
We should consider the trade offs between these before merging any changes.
For MacOS, I've heard some concerns about not using the system's locks directly. The standard locks interact well with the QoS system and handle things like priority inversion. If we implement our own locks on MacOS, we'll lose that part, which might be problematic.
We might be able to use os_unfair_lock
rather than pthread_mutex_t
, though.
it does not have timeouts for os_unfair_lock_lock/os_unfair_lock_trylock, so it is obviously out.
We only support timeouts for waiting on condition variables, not for locking mutexes, so that could be fine.
Also it doesn't support Condvar, which is required for Rust locks.
The Condvar implementation could be independent from the Mutex implementation, such as the trivial one in my experiment. (It'll be a bit more complicated if we can't use futex wait/wake, though.)
Another argument in favor of futexes is that futexes allow priority inheritance using variants of the wait and wake commands. They also allow the kernel to favor higher priority threads I think.
@bjorn3 Yes. But the priority inheritance version of those operations also expect us to store the thread id in the AtomicI32, which is not something we'd need otherwise. Getting the thread ID means accessing a thread local on every lock
operation (even when uncontended), which can be significant in some situations. Maybe it can be an optional feature, or a separate type.
There is a very simple, const
implementation of ReentrantMutex
in lock_api
which just uses a regular mutex along with two extra usize
for storing the thread id and a counter, so std
can eliminate the dependence on the pthread
locks entirely. Also, it might be nice to make that mutex public.
@m-ou-se For the last bullet point, the global map could be an intrusive tree to avoid dynamic heap allocation. Regardless, wouldn't this approach need to be considered for at least Condvar for simulating futex on systems that dont have OS provided futexes and aren't windows (e.g. netbsd if not relying on latest version, darwin if not willing to use __ulock
)?
- works on all possible platforms. It doesn't need a futex; it can use anything that can be used as a thread parker. The downside is that the locks become a pointer in size rather than just a byte (or two bits), but the advantage is that avoids the complexity of a global parking lot hashmap.
I want to point out that byte sized mutexes are not ideal performance-wise on all platforms. For example, MIPS and RISC-V only have 32/64 bit atomics instruction and 8-bit/16-bit atomics need to be mimicked with CAS. On these platforms the "downside" is less significant, comparing a pointer with a u32.
Also given that we currently use a boxed pthread mutex, changing it to a pointer is a gain from status quo anyway.
Related, but for a performant, futex-based, RwLock there's also this algorithm.
@kprotty Thanks! Will take a look at it.
Great work on https://github.com/kprotty/usync by the way. :)
Worth also looking at IMHO: Google's absl::Mutex
type (design notes, code, guide). Some interesting things about it:
- it has great instrumentation that I'd personally love to see replicated in the Rust world some day. I assume that's not in scope of this project now. I mention it just in case there's some decision you're considering that might prevent/complicate doing so later.
- contention profiling, always on in production. On unlock, if there's contention, mutexes take a stack trace and add it to a data structure to be picked up by a profiler. The profile lets you see aggregate contention on the lock over time but not the latency caused by individual events.
- trace profiling, typically enabled for short intervals on production builds. This does let you see those individual latency events but is more expensive.
- runtime deadlock detector, I think typically used in debug builds.
- it not only has userspace queueing but also hooks for the Google-internal fibers library's scheduler. Basically this library is a fancy thread pool that uses real kernel-level threads but is similar to green threads in that it uses mostly user-level scheduling (having only O(cpus) threads available to the kernel scheduler) to improve efficiency and latency. One thread chooses the next to run via a "switch to" or "futex swap" syscall. This library requires proprietary Linux kernel extensions, but recently they've been trying to get a variant of this called UMCG into the kernel.
- it's not just a mutex; it's an rwlock. (And the rarely-used
SpinLock
type is an adaptive mutex rather than just a spin lock. Google's not great at naming things.) - it supports conditional awaits checked on unlock, often used instead of a separate condvar type. (more about it in the guide)
- a cautionary tale: iirc there was at least one long-standing rare synchronization bug in this type eventually found through TLA+ modeling or some such. Sadly I don't see an open record of details.
Basically this library is a fancy thread pool that uses real kernel-level threads but is similar to green threads in that it uses mostly user-level scheduling (having only O(cpus) threads available to the kernel scheduler) to improve efficiency and latency.
How much of those efficiency gains are only existing in case of relatively homogeneous workloads on big servers with lots of ram and fast io vs on an average desktop system with rapidly changing workloads and relatively little ram and slower io? The kernel scheduler should be able to adapt better for the later use case due to having knowledge about the whole system, right? While google's fiber library only knows about the local (or cooperating) processes albeit with more details.
AFAIK, the only scenario Google's fiber schedulers have been used is where a particular process is handed some more-or-less fixed slice of a machine's CPU to use (some fractional number of cores/hyperthreads) via Linux container groups. The efficiency advantages only come up when there are lots of busy threads within that process. The goal is to allow most of the efficiency of async with an easier programming model. I'm happy to discuss more, but maybe somewhere else? I don't want to derail this issue.
Worth also looking at IMHO: Google's absl::Mutex type
It's definitely an interesting mutex. I worked at Google ~8 years ago, so I'm familiar with at least some parts of its design, since absl's version is similar in usage to the mutex that was around internally back then. Tracking the wait conditions inside of it is definitely very ergonomic in many situations, but it makes it a bit of a different beast than our std::sync::Mutex. So that's something for a community crate rather than the standard library, at least for now.
I worked at Google ~8 years ago, so I'm familiar with at least some parts of its design, since absl's version is similar in usage to the mutex that was around internally back then.
Ahh, yeah, it's just a new name. absl::Mutex
is written everywhere in the internal codebase now, and I think the open source version matches it almost exactly. (There might be a few code blocks marked Google internal that get skipped on export.)
Tracking the wait conditions inside of it is [...] something for a community crate rather than the standard library, at least for now.
Makes sense.
The wait conditions are convenient, but the biggest reason I brought up absl::Mutex
is that I hope to have something like its instrumentation some day.
For *BSD:
OpenBSD has futex()
, supporting FUTEX_WAKE, FUTEX_WAIT, and FUTEX_REQUEUE.
NetBSD has SYS___futex
, supporting FUTEX_WAKE, FUTEX_WAKE_BITSET, FUTEX_REQUEUE, FUTEX_CMP_REQUEUE, FUTEX_WAIT_BITSET, and FUTEX_WAKE_OP.
FreeBSD supports Linux' futex() in its Linux emulation layer, but natively only provides similar functionality through _umtx_op(UMTX_OP_WAKE) and _umtx_op(UMTX_OP_WAIT_UINT). Those are almost identical, except that the wake operation does not return the amount of threads it woke up, which is something our current futex-based RwLock uses.
Similarly, DragonFlyBSD has umtx_sleep() and umtx_wakeup(), where the wake function does not return the number of awoken threads.
the wake operation does not return the amount of threads it woke up, which is something our current futex-based RwLock uses.
This doesn't break the implementation, it just means that when there's both readers and writers waiting, it'll wake up both one writer and all readers when unlocking, instead of only waking up a writer. I believe Windows' SRW lock does the same.
NetBSD has SYS___futex, supporting FUTEX_WAKE, FUTEX_WAKE_BITSET, FUTEX_REQUEUE, FUTEX_CMP_REQUEUE, FUTEX_WAIT_BITSET, and FUTEX_WAKE_OP.
My understanding is that on NetBSD (like most BSDs), the syscall interface is not considered stable, is this not the case?
For *BSD:
OpenBSD has
futex()
, supporting FUTEX_WAKE, FUTEX_WAIT, and FUTEX_REQUEUE.NetBSD has
SYS___futex
, supporting FUTEX_WAKE, FUTEX_WAKE_BITSET, FUTEX_REQUEUE, FUTEX_CMP_REQUEUE, FUTEX_WAIT_BITSET, and FUTEX_WAKE_OP.FreeBSD supports Linux' futex() in its Linux emulation layer, but natively only provides similar functionality through _umtx_op(UMTX_OP_WAKE) and _umtx_op(UMTX_OP_WAIT_UINT). Those are almost identical, except that the wake operation does not return the amount of threads it woke up, which is something our current futex-based RwLock uses.
Similarly, DragonFlyBSD has umtx_sleep() and umtx_wakeup(), where the wake function does not return the number of awoken threads.
@Mouse If it helps: at some point I started to play with the futex implementations on various platforms. To be honest, I have forgotten the details. But maybe you can find something in it that is useful? https://github.com/pitdicker/valet_parking
@pitdicker Thanks! That's a great overview of what's available on all the different platforms!
So, has there been benchmarks over parking_lot word lock vs futexes vs SRW locks?
(I have already seen the surprisingly good results microbenchmarking the 2 bit lock vs the futex lock (although I feel like more diverses patterns of usage would be great))
An interesting data point is that parking_lot non-word lock can be twice faster when tested with <= 6 threads Amanieu/parking_lot#33 which is the case of most multithreaded apps.
Another unknown would be the performance of the stdlib windows SRWlock vs this SRWlock crate https://github.com/kprotty/usync
Also, note that futex V2 is a thing since linux 5.16
https://www.phoronix.com/scan.php?page=search&q=FUTEX2
I wonder if you could leverage it where supported for improved performance.
Also, I think there is at least one optimization in parking_lot worth backporting: transactional memory
https://en.m.wikipedia.org/wiki/Transactional_Synchronization_Extensions
It allow a 40% performance gain on memory access by using mutexes only as a fallback.
It offer a backward compatible API such that one hasn't to write two code paths (one for modern cpus and one for legacy).
Parking_lot use TSX hardware lock elision for RwLock here:
https://github.com/Amanieu/parking_lot/blob/master/src/elision.rs
As you can see, the code is very concise.
TSX has been plagued with issues since it's inception. There have been timing bugs with it (which can crash your system) and it has been used in side-channel attacks to eg break KASLR. Microcode updates have been provided which disable it. And newer cpu's are entirely missing hardware lock elision support I believe. Because of this I don't really think there are advantages to using it anymore. It does have the disadvantage of making it impossible to step into a lock operation with a debugger and it may break rr (not sure about the later though).
Also, note that futex V2 is a thing since linux 5.16
Kernel 5.16 only brings futex_waitv
, which might not be so useful here. Variable sized (u8 / u16 / u32 / u64) futexes (and NUMA-awareness) have not been implemented yet.
So, has there been benchmarks over parking_lot word lock vs futexes vs SRW locks?
usync
includes throughput benchmarks from parking_lot's repo with the listed locks for comparison (usync vs stdlib vs parking_lot vs SRWLOCK-or-pthread).
An interesting data point is that parking_lot non-word lock can be twice faster when tested with <= 6 threads
The PR that was linked seems to be unrelated to the claim. Was the issue number off?
Parking_lot use TSX hardware lock elision for RwLock here:
HLE in parking_lot is disabled by default and hidden under a feature flag (hardware-lock-elision
). I also tried implementing it in usync at one point but it didn't seem to effect throughput in RwLock benchmarks on the x86 machines HLE was benched under so it was removed.
There is a very simple,
const
implementation ofReentrantMutex
inlock_api
which just uses a regular mutex along with two extrausize
for storing the thread id and a counter, sostd
can eliminate the dependence on thepthread
locks entirely.
@joboet I forgot to respond, but this happened in #96042
Also, it might be nice to make that mutex public.
That's certainly a possibility, but that requires a (small) RFC that includes the motivation and full API. One interesting question would be what to do with poisoning for such a ReentrantMutex: std's other locks have poisoning, but I'm not sure if any new locks should have that, although consistency is also important. (But that's not a discussion we should have here on this issue! That's for the RFC or a separate PR/issue/Zulip/internals thread.)
To unblock const Mutex::new
, RwLock::new
and Condvar::new
, #97647 changes these functions to no longer allocate on any platform, by instead using lazy allocation on the platforms where still use pthread.
I've noted that the description has changed from "use a thread parker" to "use lazy allocation" (which is just merged). I agree that lazy allocation is the best short-term option, but it should be noted that #85434 will not be fixed with lazy allocation; to fix it we still need a thread-parker based implementation (although this is no longer a blocker for const).
I've noted that the description has changed from "use a thread parker" to "use lazy allocation" (which is just merged). I agree that lazy allocation is the best short-term option, but it should be noted that #85434 will not be fixed with lazy allocation; to fix it we still need a thread-parker based implementation (although this is no longer a blocker for const).
That would make it impossible to use priority inheritance however, as we could not reliably reset the priority of an inheriting thread as it could have been changed while the lock was held.
For macOS, all problems would be solved if we could just use ulock
: it is fast, lightweight, supports priority inheritance and even fancy things like spinning while the lock owner is active on another core. Does anyone know any Apple employees to contact about stabilizing this API? As the symbols are defined from 10.12 onwards, we would only need a fallback on older macOS versions, which seems justifiable to me (as long as Rust works OK on these systems, I think we can sacrifice a bit of performance in order to boost newer versions).
If that is not possible, we can use os_unfair_lock
for Mutex
(benefitting from all the nice things), falling back to a lazily-allocated pthread_mutex
on older versions (if the symbols are not present). We can then fit our implementations of condition variables around this. For RwLock
, priority-inheritance is not relevant anyways, so the userspace implementation can be used unconditionally. os_unfair_lock
is unfortunately not strictly speaking movable when unlocked, but its API implies it is (there are no destroy
functions and it is initialized with a constant).
I agree that lazy allocation is the best short-term option, but it should be noted that #85434 will not be fixed with lazy allocation; to fix it we still need a thread-parker based implementation (although this is no longer a blocker for const).
We could also still do what you suggested and try_lock
the lock before dropping or leaking it.
Does anyone know any Apple employees to contact about stabilizing this API?
I've tried through various routes, but it seems like we will not be able to use those APIs any time soon, unfortunately.
If that is not possible, we can use os_unfair_lock for Mutex [โฆ]
Yes, using os_unfair_lock
together with a queue or thread parker based condition variable would be a good improvement. I suppose technically the os_unfair_lock
documentation doesn't guarantee it's movable, but it's just a u32
with no constructor or destructor functions, so it's nearly impossible for it to not be movable when not locked. However, we can only guarantee it's not moved when not borrowed, which doesn't necessarily mean that it's not locked, as the MutexGuard
might have been leaked. It might be nice to have some confirmation that it's okay for the memory of a permanently locked os_fair_lock
to be re-used.
That would make it impossible to use priority inheritance however, as we could not reliably reset the priority of an inheriting thread as it could have been changed while the lock was held.
We don't have priority inheritance for our futex implementation either. The doc deliberately says that locking behaviour is unspecified so we can be flexible about implementation.
For most users, a small and lightweight mutex would outweight the lack of priority inheritance. (There's no way to set priority with just std anyway, so I'd argue that users shouldn't expect anything about priority inheritance).
How important priority inheritance is, depends on the platform. Both macOS+iOS and Fuchsia folks have told me it's very important on their platforms.
Is the main goal of priority inheritance to avoid priority inversion? If so, priority inversion feels like anti-pattern inherently due to requiring threads in different priorities contending on the same lock.
Is the main goal of priority inheritance to avoid priority inversion? If so, priority inversion feels like anti-pattern inherently due to requiring threads in different priorities contending on the same lock.
If priority inheritance is implemented, the pattern can be quite useful. For example, in an audio player, the decoding of the next song could be run in a background thread while holding a mutex. If the song needs to be played but it has not finished decoding, the high-priority player thread can lock the mutex, thereby raising the priority of the background thread until it finishes. This pattern is really hard to implement with channels or other synchronization primitives, but PI-mutexes make it easy.
Of course, these mutexes could be implemented in a crate and most users don't actually use priorities. However, as far as I understand (having not worked on any macOS specific projects), priorities are much more prevalent in idiomatic macOS programming, for example with the Dispatch framework, so respecting the platform default is definitely a good choice.
One would argue that you shouldn't be using mutexes for such scenarios. Instead, boosting the background thread's priority directly then waiting for it to complete or having the audio thread take over the decoding work itself via work stealing. PI-mutexes in general (excluding os_unfair_lock
) have too high of a throughput cost for general use so a crate, as you've mentioned, is a good alternative.
Darwin's dispatch is fairly inefficient of a platform to program with (concurrent queues are poorly scheduled, dispatch_async requires dynamic heap allocation, dispatch_io_read/write defer to poorly scalable fs thread pools, etc.). The recommended approach for dispatch usage is to avoid blocking locks (avoids situations for priority inversions) and instead have a small, fixed amount of serial queues / "actors" on necessary priorities - treating multi-priority locking as an anti-pattern here as well.
I feel like priority inheritance is being dismissed too aggressively.
Instead, boosting the background thread's priority directly then waiting for it to complete or having the audio thread take over the decoding work itself via work stealing.
In a simple example like this, you could manage scheduling manually if the OS supports it. But can you explain why that is better than having a mutex do it for you?
Work stealing implies adopting a very different program structure to avoid a scenario that could easily be solved by your mutex implementation.
PI-mutexes in general (excluding
os_unfair_lock
) have too high of a throughput cost for general use
I do not know about the general landscape, but as you note, there are exceptions. Fuchsia builds priority inheritance directly into its futexes so it's not very expensive at all.
Darwin's dispatch is fairly inefficient of a platform to program with (concurrent queues are poorly scheduled, dispatch_async requires dynamic heap allocation, dispatch_io_read/write defer to poorly scalable fs thread pools, etc.).
Sure, but Dispatch doesn't exist to provide high-throughput concurrency primitives for general compute. Its goal is to enable concurrency/parallelism, without subtle bugs, in apps that prioritize user responsiveness and efficient use of system resources.
The recommended approach for dispatch usage is to avoid blocking locks (avoids situations for priority inversions) and instead have a small, fixed amount of serial queues / "actors" on necessary priorities - treating multi-priority locking as an anti-pattern here as well.
Unless I'm misunderstanding something in this thread, locks on darwin don't cause priority inversions because they inherit priority. The recommended practice of using queues over locks is generally useful because it can simplify code, reduce bugs, and reduce the need for context switching, but I strongly suspect there are many use cases where you do need a lock.
The example that @joboet gave is one. The situation is that you have set things up to make efficient use of system resources by not prioritizing urgent things, until they actually do become urgent. Then you want the system to be able to respond effectively. I don't know of a way to change the priority/QoS of a queue that affects already running tasks. That makes priority inheritance important for responsiveness.
You can say that the application developer needs to rewrite their code, but that doesn't help the user whose iPhone catches on fire every time they hit play. So the platform prioritizes fixing these situations just in time whenever possible.
I think the broader point I want to make is two-fold:
- The kernel scheduler needs to make global scheduling decisions. Knowing who's waiting on who can help it do its job more effectively. Hiding this information and attempting to make those decisions locally within a process means every programmer is implementing their own scheduler using limited information, and this can easily lead to bad outcomes.
- Removing priority inheritance on platforms that use it privileges one class of use cases over another. Microbenchmarks are great if what you care about is high-throughput compute. If you care about user responsiveness under 99th-percentile system load, they aren't very helpful. Furthermore, I suspect that priority inheritance exists on these systems for good reason.
I do not know about the general landscape, but as you note, there are exceptions. Fuchsia builds priority inheritance directly into its futexes so it's not very expensive at all.
For platforms that do have PI-futexes, it might not be so expensive. But macOS doesn't provide PI-futexes.
The example that @joboet gave is one. The situation is that you have set things up to make efficient use of system resources by not prioritizing urgent things, until they actually do become urgent. Then you want the system to be able to respond effectively. I don't know of a way to change the priority/QoS of a queue that affects already running tasks. That makes priority inheritance important for responsiveness.
That's a reasonable use case for a PI mutex. But does it have to be the mutex provided by std? We currently don't have any guarantee that std mutex will do PI, and will certainly not guarantee it because not all platforms have PI. We could guarantee PI for a specific platform, but that'll make programs depending on PI less portable.
- Removing priority inheritance on platforms that use it privileges one class of use cases over another. Microbenchmarks are great if what you care about is high-throughput compute. If you care about user responsiveness under 99th-percentile system load, they aren't very helpful. Furthermore, I suspect that priority inheritance exists on these systems for good reason.
It can also be said that requiring PI privileges one class of use cases over another.
I would also point out that we certainly don't use PTHREAD_PRIO_INHERIT
currently, so it is incorrect to say that we are "removing priority inheritance" from a platform because we don't have one in the first place.
My apology if pthread mutex on macOS enables priority inheritance by default.
I don't have problem with people using PI, I just think that it shouldn't be the default, especially if doing so would disadvantage users not requiring PI. It could be a third party crate or a platform-specifc extension.
I think there are two independent questions here:
- Should we support priority inheritance? Yes, I think we should.
- Should we enable priority inheritance by default? Only if it has zero overhead.
I would also point out that we certainly don't use
PTHREAD_PRIO_INHERIT
currently, so it is incorrect to say that we are "removing priority inheritance" from a platform because we don't have one in the first place.My apology if pthread mutex on macOS enables priority inheritance by default.
Ah, I probably overgeneralized! pthread mutex does enable PI by default for Fuchsia, there was a PR removing it, and I projected that onto the macOS discussion in this thread. Sorry.
- Should we enable priority inheritance by default? Only if it has zero overhead.
Generally speaking I think we should follow the default or preferred method for each platform. If using non-PI locks is considered a hazard for the platform's use case and scheduler, then we should use PI by default on that platform. If it isn't clear what the preference is, then going with the least overhead approach makes sense.
My apology if pthread mutex on macOS enables priority inheritance by default.
If the people working on macOS and Fuchsia say that priority inheritance in locks on their platforms is important, I see no reason to dismiss that. As of #97647, changes to Rust's locks on those platforms are no longer blockers for const fn new
, so there's no need to push for changes beyond what the maintainers of those platforms are comfortable with.
But can you explain why that is better than having a mutex do it for you?
@tmandry A priority inheritance mutex, for platforms that don't natively support it, requires creating a custom mutex around the thread id and atomically viewing/updating its priority for waiters. This involves external mutual exclusion and degrades the throughput of lock waiting on contention. Platforms which do support PI mutexes (FreeBSD umtx_op(mutex_lock)
and Linux futex(FUTEX_PI_LOCK)
) excluding things like darwin os_unfair_lock
and fuschia futex_owner
have even worse base throughput than their non-PI, futex-based equivalents. It penalizes the common case of not needing to be PI aware.
Work stealing implies adopting a very different program structure to avoid a scenario that could easily be solved by your mutex implementation.
You could easily solve the situation by boosting all threads in the equation and using normal mutexes as well. For the case where ease of use is a priority, this may even be better. The argument for work stealing is that it's better latency-wise for a thread already running in high priority to take over the work it needs done than to boost another threads priority then wait for it to finish and wake it up. This is through the lens of scheduling efficiency, which I assume was the reason for using high priority threads in the first place.
Its goal is to enable concurrency/parallelism, without subtle bugs, in apps that prioritize user responsiveness and efficient use of system resources.
libdispatch, while I cant claim any better on "subtle bugs" (excluding its infamous thread explosion from global queues and dispatch_sync) or "user responsiveness", does not efficiently use system resources compared to other/custom scheduling policies as pointed out by needless heap allocation and poor concurrent queue scaling. I may also be using a different baseline for "efficient".
That makes priority inheritance important for responsiveness.
Priority inheritance, in the general sense, is important but it doesn't have to be achieved via the kernel. It can be achieved in userspace (e.g. work stealing, manual boosting) since the "task" is of scheduling importance not the threads themselves. Paradoxically, being offered by the OS may or may not be better than userspace solutions.
Knowing who's waiting on who can help it do its job more effectively
While making sense theoretically, it appears true in practice for some (e.g. darwin, maybe freebsd or fuschia) and less so for others (e.g. windows, linux, netbsd) which is unfortunate.
Hiding this information and attempting to make those decisions locally within a process means every programmer is implementing their own scheduler using limited information, and this can easily lead to bad outcomes.
They effectively have to for achieving high throughput or low latency. OS-wide scheduling decisions (UMS, sleep for yielding, priority based scheduling to some degree) don't scale for individual tasks. What makes programs fast or responsive are local decisions like cache access, userspace task scheduling, manual IO/compute batching, etc. as they avoid going through the OS for optimization decisions. Even more extreme methods of program optimization involve assuming that the program owns the entire machine / bypasses the OS (cache prefetching, thread pinning, io_uring SQ thread, etc.)
If you care about user responsiveness under 99th-percentile system load, they aren't very helpful
Benchmarks can record things like lock acquisition time statistics with latency percentiles to measure responsiveness. This is helpful for assessing properties like lock fairness and wait queue overhead under configurable low and high contention.
Regarding the Mutex provided by Rust stdlib, I don't mean to advocate for not using PI-aware blocking/locking when it's advised by the platform or practically free in execution cost. However I've hopefully explained why it shouldn't be a requirement as an implementation detail for Mutex on all or any specific platform. I do advocate for higher throughput solutions even in the presence of PI-aware alternatives, although that doesn't seem to be a priority here (no pun intended).
I have a question about the high-level plan here: if I understood correctly, we now have (or will have) a generic Mutex
/RwLock
implementation on top of thread parking:
It also means that porting the standard library to a new platform will be easier, and maintainance of the std implementation for the less common platforms will become easier. These platforms will now only have to implement a thread parker and can rely on a performant and correct implementation of the standard locks on top of that.
Is that still the plan? If it is, I assume the "generic" thread parker implementation will disappear, because instead we will have "generic" Mutex
/RwLock
implementations? (I didn't see such a work item in the amazingly detailed tracking at the top of this issue, hence the question.)
I have a question about the high-level plan here: if I understood correctly, we now have (or will have) a generic
Mutex
/RwLock
implementation on top of thread parking:It also means that porting the standard library to a new platform will be easier, and maintainance of the std implementation for the less common platforms will become easier. These platforms will now only have to implement a thread parker and can rely on a performant and correct implementation of the standard locks on top of that.
Is that still the plan? If it is, I assume the "generic" thread parker implementation will disappear, because instead we will have "generic"
Mutex
/RwLock
implementations? (I didn't see such a work item in the amazingly detailed tracking at the top of this issue, hence the question.)
I've now submitted PRs replacing it on all currently supported platforms (#97140, #98391, #98534), so the generic parker can be removed, although it doesn't need to be.
Thanks!
So where can I find those generic parker-based Mutex/RwLock implementations? The one in sys_common
seems to just wrap some platform-specific implementation.
Ah, that explains my confusion. :) Thanks!
There is a slight issue with os_unfair_lock
: if it is locked recursively, it raises SIGTRAP
. This is allowed by the API and does not result in unsoundness, but it is really unidiomatic and hard to debug. While it is possible to guard against (e.g. by installing a deadlocking signal handler and resetting it if the lock was acquired), that could negate the performance gain over the current mutex.
Edit: actually, it calls __builtin_trap
(like core::intrinsics::abort
), which does not have any specified behaviour. However, mach_thread_self()
can be used to obtain a non-null thread id, making it possible to guard against re-entrant locking with only two extra (relaxed) atomic operations and 4 extra bytes of storage. Incidentally, the same id is used in the lock itself, but Apple's documentation explicitly forbids relying on the implementation of the lock.
For Darwin, it I think it's quite unlikely we could use
__ulock_{wait,wake}
for a futex API directly without risking... many problems. [..]That said, there's... a pretty cursed option for a workaround on these targets. Too cursed to use? Not sure... possibly.
That is:
libc++
's implementation ofstd::atomic<u32>::wait
,notify_one
, andnotify_all
will ultimately bottom out to calls to__ulock_wait
/__ulock_wake
, which... I guess they get to use because libc++ gets shipped and updated with the OS on Darwin, fair enough. Anyway, in theory we could just call that somehow. And there's a sense in which this would be pretty nice, since it's a system-provided function that implements (something similar to) the API we want... But it's certainly not ideal for a few reasons: [..]
@thomcc I implemented that approach here: https://github.com/m-ou-se/atomic-wait/blob/main/src/macos.rs
Unfortunately, it doesn't have any timeout functionality. :(
Could implement futex for darwin using a global hash table + pthread primitives like parking_lot does?
Unfortunately, it doesn't have any timeout functionality. :(
Ah, yeah, that's unfortunate... FWIW I did convince @complexspaces to ask Apple engineers about stabilizing __ulock*
when they met with them in a Lab at the most recent WWDC. They were apparently positive on the idea, but no movement has happened since then, AFAIK.
Could implement futex for darwin using a global hash table + pthread primitives like parking_lot does?
Might be fine on Darwin given that webkit does it, but may need different tuning than the existing impls, since spinning tends to do worse on their machines.
A global hash table doesn't scale well to many threads.
A global hash table doesn't scale well to many threads.
@bjorn3 This is how futex is implemented under most OS. Scalability is addressed by having enough buckets so that collision is rare, growing the buckets for the same reason, or using trees to make operations O(log n) to threads in the process (which wont be many).
Linux uses roundup_pow_of_two(256 * num_possible_cpus())
to calculate the hash table size. num_possible_cpus()
is not something we can't get in userspace due to cpu hotplugging. In addition Linux has a single roundup_pow_of_two(256 * num_possible_cpus())
sized table for the entire system, while doing this in user space would duplicate such a table for each process. Say you have a 64 core system, then you would get a 2566440/1024=640kb table for every single process. (every pthread_mutex_t
is 40 bytes on Linux)
growing the buckets for the same reason
That requires blocking all on all threads that currently have a mutex locked and having all contended mutexes block on the next lock until the hash table has been grown. Additionally it makes allocation less predictable and may even deadlock if the allocator needs to acquire a lock.
FWIW, I don't think anybody is suggesting we switch to this on Linux. It would only be used as the fallback on platforms where there's no stable futex-like API. I think Darwin is the main case here, although there are probably a few tier2 platforms that would get it as well.
Linux uses roundup_pow_of_two(256 * num_possible_cpus()) to calculate the hash table size. num_possible_cpus() is not something we can't get in userspace due to cpu hotplugging
I don't think Darwin supports CPU hotplugging to increase the number of CPUs beyond what you boot with, so I believe this is sysctlbyname("hw.ncpus", ...)
there at least (note that while hw.ncpus
is technically deprecated, it's actually the one we want, since it's the boot-time value of hw.logicalcpu_max
).
I agree about most of the other downsides -- it's not free. I do think it's worth considering though (although likely with different tuning than the Linux impl), even on Darwin, the benefits led WebKit (which is quite careful to avoid things that cause performance issues on Darwin platforms) to use it, despite the per-process overhead and invisibility to the OS scheduler.
Admittedly, they don't use it for 100% of their locks.
Additionally it makes allocation less predictable and may even deadlock if the allocator needs to acquire a lock.
This is already the case on the platforms where this would be used :(.
num_possible_cpus() is not something we can't get in userspace due to cpu hotplugging
@bjorn3 A userspace implementation doesn't have to copy linux verbatim. The equivalent would be std::thread::available_parallelism()
but linux's approach in general is still sub-par for userspace as it requires dynamic heap allocation (the allocator may require the Mutex so its a circular dependency as you note).
while doing this in user space would duplicate such a table for each process
Yes, the table doesn't have to be large and its fine to be per-process. This is what Zig, Windows, parking_lot, Go, and OpenBSD do under the hood.
Additionally it makes allocation less predictable and may even deadlock if the allocator needs to acquire a lock.
Yes. I believe this was one of the main/current blockers for getting parking_lot
into the stdlib.
What I suggest as a futex implementation isn't one that dynamically heap allocates, but something more like Zig, NetBSD, or Go where there's a static, fixed-size array of buckets. Each bucket holding a self-balancing search tree of intrusive nodes containing a list of all threads waiting on the same address.
This avoids heap allocation (the nodes for each thread can live on their stack in the futex call) and handles scalability by waiting/waking taking at worst O(log n) lookup/deletion (where n = concurrent addresses being waited on that all hash collide to the same bucket) and O(1) insertion/removal. Here's an example userspace implementation.
Is it still possible to voice out concerns or is the implementation a done deal? I would like to see RwLock's read portion be recursive so that it is symmetric to Rust's references. Otherwise, more abstracted code might have to hand-roll their own forms of check to avoid deadlock, or change their design in an awkward manner.
This was an intentional design decision. Recursive read locks effectively require that readers always acquire the lock without block if there is already an existing reader. The problem is that this could lead to writer starvation where readers are always overlapping and a writer never gets a chance to acquire a write lock. This is important because many RwLock
workloads are read-heavy with relatively rare writes.
Oh, that is quite surprising. It seems pretty hard to ensure in a codebase that there is no reader already active on the current thread -- that kind of reasoning is not usually required in Rust, and is hard to do in a compositional/modular way without compiler support. (Everybody knows one needs to be aware of that when writers are involved, but since writers are usually rare, that is a lot easier to do.) It is quite common to explain RwLock as basically a thread-safe RefCell, so subtle extra panic conditions like not allowing recursive readers are footguns waiting to go off.
I worry this might lead to hard to diagnose random panics for RwLock users. This is even more the case if recursive reading only panics on some but not all platforms, or only panics when there is a writer waiting.
Isn't it possible to let in new readers from threads that already have a read lock, but block readers from other threads?
AFAIK it doesn't panic on recursive reads, it just deadlocks if there happens to be another writer waiting for the lock.
Isn't it possible to let in new readers from threads that already have a read lock, but block readers from other threads?
That makes the lock much more complicated since it would need to track the thread IDs for all thread that own a read lock (of which there can be arbitrarily many).
AFAIK it doesn't panic on recursive reads, it just deadlocks if there happens to be another writer waiting for the lock.
That seems worse than a panic.^^ And the "if there is a writer waiting" makes it near impossible to test for this scenario.
From what I could see, the documentation only mentions the possibility of panics on recursive reads, not deadlocks.
That makes the lock much more complicated since it would need to track the thread IDs for all thread that own a read lock (of which there can be arbitrarily many).
Yeah I know. But the current situation makes using the lock in large codebases much more complicated.
I disagree that this kind of reasoning isn't usually required -- you already need to keep an eye on what locks are currently held on your thread in order to avoid deadlocks, otherwise it seems likely that you'll hit lock ordering issues.
In parking_lot I added a read_recursive
method which explicitly allows recursive read locks, but then you lose the no-writer-starvation guarantee.
I disagree that this kind of reasoning isn't usually required -- you already need to keep an eye on what locks are currently held on your thread in order to avoid deadlocks, otherwise it seems likely that you'll hit lock ordering issues.
You need to look out for conflicts between writers and other lock users, sure. Same as with RefCell
.
But not usually for mutual conflicts between all lock users. (For RwLocks.) This is a departure from RefCell
.
This is a departure from RefCell.
IMO, RwLock
shouldn't be treated as a drop-in for RefCell
. The latter only tracks state (readers/writers), while the former synchronizes state (by blocking the caller until the desired state is achievable). An atomic RefCell
more closely matches the use of try_read()
/try_write()
methods on an RwLock
instead.
You need to look out for conflicts between writers and other lock users, sure. Same as with RefCell.
No, you need to look out for every other lock currently held, in addition to all other uses of the same lock. It's a potential deadlock if you acquire A
then B
in one place, and B
then A
somewhere else. This can be true even for read acquires. It doesn't feel that different to me, but is wholly unlike anything you must consider for RefCell.