sync: RWMutex scales poorly with CPU count

On a machine with many cores, the performance of sync.RWMutex.R{Lock,Unlock} degrades dramatically as GOMAXPROCS increases.

This test program:

package benchmarks_test

import (
	"fmt"
	"sync"
	"testing"
)

func BenchmarkRWMutex(b *testing.B) {
	for ng := 1; ng <= 256; ng <<= 2 {
		b.Run(fmt.Sprint(ng), func(b *testing.B) {
			var mu sync.RWMutex
			mu.Lock()

			var wg sync.WaitGroup
			wg.Add(ng)

			n := b.N
			quota := n / ng

			for g := ng; g > 0; g-- {
				if g == 1 {
					quota = n
				}

				go func(quota int) {
					for i := 0; i < quota; i++ {
						mu.RLock()
						mu.RUnlock()
					}
					wg.Done()
				}(quota)

				n -= quota
			}

			if n != 0 {
				b.Fatalf("Incorrect quota assignments: %v remaining", n)
			}

			b.StartTimer()
			mu.Unlock()
			wg.Wait()
			b.StopTimer()
		})
	}
}

degrades by a factor of 8x as it saturates threads and cores, presumably due to cache contention on &rw.readerCount:

# ./benchmarks.test -test.bench . -test.cpu 1,4,16,64
testing: warning: no tests to run
BenchmarkRWMutex/1      20000000                72.6 ns/op
BenchmarkRWMutex/1-4    20000000                72.4 ns/op
BenchmarkRWMutex/1-16   20000000                72.8 ns/op
BenchmarkRWMutex/1-64   20000000                72.5 ns/op
BenchmarkRWMutex/4      20000000                72.6 ns/op
BenchmarkRWMutex/4-4    20000000               105 ns/op
BenchmarkRWMutex/4-16   10000000               130 ns/op
BenchmarkRWMutex/4-64   20000000               160 ns/op
BenchmarkRWMutex/16     20000000                72.4 ns/op
BenchmarkRWMutex/16-4   10000000               125 ns/op
BenchmarkRWMutex/16-16  10000000               263 ns/op
BenchmarkRWMutex/16-64   5000000               287 ns/op
BenchmarkRWMutex/64     20000000                72.6 ns/op
BenchmarkRWMutex/64-4   10000000               137 ns/op
BenchmarkRWMutex/64-16   5000000               306 ns/op
BenchmarkRWMutex/64-64   3000000               517 ns/op
BenchmarkRWMutex/256                    20000000                72.4 ns/op
BenchmarkRWMutex/256-4                  20000000               137 ns/op
BenchmarkRWMutex/256-16                  5000000               280 ns/op
BenchmarkRWMutex/256-64                  3000000               602 ns/op
PASS

A "control" test, calling a no-op function instead of RWMutex methods, displays no such degradation: the problem does not appear to be due to runtime scheduling overhead.

Possibly of interest for this: http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf

cc @dvyukov

It may be difficult to apply the algorithm described in that paper to our existing sync.RWMutex type. The algorithm requires an association between the read-lock operation and the read-unlock operation. It can be implemented by having read-lock/read-unlock always occur on the same thread or goroutine, or by having the read-lock operation return a pointer that is passed to the read-unlock operation. Basically the algorithm builds a tree to avoid contention, and requires each read-lock/read-unlock pair to operate on the same node of the tree.

It would be feasible to implement the algorithm as part of a new type, in which the read-lock operation returned a pointer to be passed to the read-unlock operation. I think that new type could be implemented entirely in terms of sync/atomic functions, sync.Mutex, and sync.Cond. That is, it doesn't seem to require any special relationship with the runtime package.

Locking per-P slot may be enough and is much simpler:
https://codereview.appspot.com/4850045/diff2/1:3001/src/pkg/co/distributedrwmutex.go

What happens when a goroutine moves to a different P between read-lock and read-unlock?

RLock must return a proxy object to unlock, that object must hold the locked P index.

The existing RWMutex API allows the RLock call to occur on a different goroutine from the RUnlock call. We can certainly assume that most RLock / RUnlock pairs occur on the same goroutine (and optimize for that case), but I think there needs to be a slow-path fallback for the general case.

(For example, you could envision an algorithm that attempts to unlock the slot for the current P, then falls back to a linear scan if the current P's slot wasn't already locked.)

At any rate: general application code can work around the problem (in part) by using per-goroutine or per-goroutine-pool caches rather than global caches shared throughout the process.

The bigger issue is that sync.RWMutex is used fairly extensively within the standard library for package-level locks (the various caches in reflect, http.statusMu, json.encoderCache, mime.mimeLock, etc.), so it's easy for programs to fall into contention traps and hard to apply workarounds without avoiding large portions of the standard library. For those use-cases, it might actually be feasible to switch to something with a different API (such as having RLock return an unlocker).

For these cases in std lib atomic.Value is much better fit. It is already used in json, gob and http. atomic.Value is perfectly scalable and virtually zero overhead for readers.

I agree in general, but it's not obvious to me how one could use atomic.Value to guard lookups in a map acting as a cache. (It's perfect for maps which do not change, but how would you add new entries to the caches with that approach?)

What kind of cache do you mean?

E.g., the one in reflect.ptrTo.

See e.g. encoding/json/encode.go:cachedTypeFields

Hmm... that trades a higher allocation rate (and O(N) insertion cost) in exchange for getting the lock out of the reader path. (It basically pushes the "read lock" out to the garbage collector.)

And since most of these maps are insert-only (never deleted from), you can at least suspect that the O(N) insert won't be a huge drag: if there were many inserts, the maps would end up enormously large.

It would be interesting to see whether the latency tradeoff favors the RWMutex overhead or the O(N) insert overhead for more of the standard library.

@dvyukov Thanks. For the reflect.ptrTo case I wrote it up as https://golang.org/cl/33411. It needs some realistic benchmarks--microbenchmarks won't prove anything one way or another.

It would be feasible to implement the algorithm as part of a new type, in which the read-lock operation returned a pointer to be passed to the read-unlock operation. I think that new type could be implemented entirely in terms of sync/atomic functions, sync.Mutex, and sync.Cond.

Indeed. Seems like it might be worth an experiment. If nothing else, might end up being useful at go4.org.

For these cases in std lib atomic.Value is much better fit.

Not always. atomic.Value can end up being a lot more code and complication. See CL 2641 for a worked example. For low level performance critical things like reflect, I'm all for atomic.Value, but much of the rest of the standard library, it'd be nice to fix the scalability of RWMutex (or have a comparably easy to use alternative).

Note that in most of these cases insertions happen very, very infrequently. Only during server warmup when it receives a first request of a new type or something. While reads happen all the time. Also, no matter how scalable RWMutex is, it still blocks all readers during updates increasing latency and causing large overheads for blocking/unblocking.

For the reflect.ptrTo case I wrote it up as https://golang.org/cl/33411. It needs some realistic benchmarks--microbenchmarks won't prove anything one way or another.

Just benchmarked it on realistic benchmarks in my head. It is good :)

See CL 2641 for a worked example.

I would not say that it is radically more code and complication. Provided that one does it right the first time, rather than do it non-scalable first and then refactor everything.

CL https://golang.org/cl/33411 mentions this issue.

https://go-review.googlesource.com/#/c/33852/ has a draft for a more general API for maps of the sort used in the standard library; should I send that for review? (I'd put it in the x/sync repo for now so we can do some practical experiments.)

@jonhoo built a more scalable RWMutex here: https://github.com/jonhoo/drwmutex/

One problem of https://github.com/jonhoo/drwmutex/ is that it doesn't handle the case when the user increases GOMAXPROCS at runtime (because New() just allocates a static slice of Mutexes.)

@minux That would be easy enough to fix by using runtime.NumCPU() instead.

Sure, but it wastes significant amount of memory when the system has a lot processors. Real per-cpu mutex should tie to P.

@minux I did at some point have a benchmark running on a modified version of Go that used P instead of CPUID. Unfortunately, I can't find that code any more, but from memory it got strictly worse performance than the CPUID-based solution. The situation could of course be different now though.

I attempted to build a scalable version of the "Left-Right lock" from http://concurrencyfreaks.blogspot.com/2013/12/left-right-classical-algorithm.html to address @dvyukov's "Also, no matter how scalable RWMutex is, it still blocks all readers during updates increasing latency and causing large overheads for blocking/unblocking" comment above and ran into the same issue. Specifically, there appears to be insufficient support from the runtime for such algorithms.

See https://github.com/balasanjay/lrlock/blob/master/runtime.go for the exact set of functions I'd need from the runtime to make it work. (It may seem like runtime.GOMAXPROCS(-1) suffices for the first one, but it unexpectedly takes a global lock internally, even when its not for a mutation).

It seems to me that such functions are general-purpose enough to support a few different use-cases. Is there any interest in a proposal adding support for some similar functions for Go 1.9?

I should note that sync.Pool's implementation has an internal hook it uses to get access to a similar "eventually consistent proc id':

go/src/sync/pool.go

Line 154 in ba048f7

pid := runtime_procPin()

. It'd be nice to expose a primitive form of this hook (without exposing procPin/procUnpin), preferably as an function backed by an SSA intrinsic.

@balasanjay The algorithm you linked seems to be missing something.

Reader:

Read the current value of versionIndex and if it is 0 then increment readersVersion0. If it is 1 then increment readersVersion1;

Writer:
…
5. Toggle versionIndex from 0 to 1 or from 1 to 0;
6. If versionIndex is currently zero then wait until readersVersion1 is at zero, otherwise, wait until readersVersion0 reaches zero;

Doesn't that introduce a race between the writer updating versionIndex and the reader incrementing the corresponding readersVersion?

@balasanjay The LR algorithm seem orthogonal to making RWMutex more scalable: if you already have a scalable RWMutex then you can implement the LR algorithm in terms of it (using RLock for the "increment readersVersionN" step and Lock for the "wait until readersVersionN reaches 0" step).

If you don't already have a scalable RWMutex, then you need some scalable mechanism for incrementing readersVersionN and checking whether it is nonzero. But if you have that, then it's fairly straightforward to implement a scalable RWMutex in terms of it.

@bcmills I think it is a semantic race, but I also think we both misunderstood slightly the purpose of the readersVersion. Specifically, I think we had both assumed that it was effectively a reference counter for each of the left and right datastructure, but now that I read the paper more closely, its really a handshake mechanism between the reader and writer. Specifically, whether it reads the old or new versionIndex, we know that it hasn't yet read the leftRight variable (because the reader reads that second). The writer has at this point already updated leftRight though, so no matter which one the reader sees, the reader will always read the new data. It is possible that the current writer will still spuriously wait for this reader to complete its read, but that doesn't really hurt anything. The part of the paper that made this make more sense to me is "2.2.1 Versioning Mechanism".

This made me realize that my implementation definitely has a critical bug, since it was built on that faulty ref-counting assumption. Specifically, it'll try to resize the array of counters to match GOMAXPROCS after waiting on it, and that'll allow an actual data race.

Regardless, I think you're right. If we had a scalable RWMutex, it seems straightforward to me to implement the LR algorithm using it as a primitive. The construction would be wait-free in practice for readers.

Both @dvyukov's prototype scalable RWMutex above, and the non-busted version of the LR algorithm will require an API to store per-proc data that gracefully handles GOMAXPROCs being resized. I do think at some point that will become a necessity to avoid depressing things like https://github.com/prometheus/client_golang/blob/606b8f85e51a39f527623efe2599804c28f5bce0/prometheus/value.go#L100, which is used by k8s/etcd/cockroachdb.

We already have per-P data, see e.g. sync.Pool. But we decided to not expose it to all packages, because (1) it is too brittle, (2) designing a stable API for this is hard.
But it won't save you from the depressing things, because even if yo have per-P data you still need to use atomic operations, because goroutines can be switched while working with per-P data.

CL https://golang.org/cl/36831 mentions this issue.

CL https://golang.org/cl/36617 mentions this issue.

CL https://golang.org/cl/41871 mentions this issue.

CL https://golang.org/cl/41930 mentions this issue.

CL https://golang.org/cl/41931 mentions this issue.

CL https://golang.org/cl/41990 mentions this issue.

CL https://golang.org/cl/41991 mentions this issue.

CL https://golang.org/cl/42110 mentions this issue.

CL https://golang.org/cl/42113 mentions this issue.

					for i := 0; i < quota; i++ {
						mu.RLock()
						mu.RUnlock()
					}

It is pointless to use RWMutex when you do nothing under the lock.

For RLock to be useful and performant/scalable, you really need to have some work between RLock and RUnlock.

Maybe fixed here #33747?

I spent some time hacking on this over the holidays, I'll try and have some CLs ready sometime this week.

(@networkimprov seems unlikely, this issue is about a more fundamental problem with RWMutexes)

The good news is that the new implementation significantly helps the performance of this benchmark in the multi-threaded case.

name                      old time/op  new time/op   delta
RWMutexIssue17973/1-48    12.5ns ± 8%   24.2ns ± 8%    +92.82%  (p=0.008 n=5+5)
RWMutexIssue17973/4-48    34.3ns ± 6%   10.3ns ± 5%    -69.98%  (p=0.008 n=5+5)
RWMutexIssue17973/16-48   36.9ns ± 4%    2.8ns ± 1%    -92.55%  (p=0.016 n=5+4)
RWMutexIssue17973/64-48   35.9ns ± 3%    1.7ns ± 3%    -95.38%  (p=0.008 n=5+5)
RWMutexIssue17973/256-48  36.3ns ± 3%    1.7ns ± 3%    -95.41%  (p=0.008 n=5+5)

The bad (somewhat expected) news is that it penalizes write-heavy workloads significantly. An example is the current benchmark "RWMutexUncontended":

name                      old time/op  new time/op   delta
RWMutexUncontended-48     2.29ns ± 0%  74.92ns ± 2%  +3171.62%  (p=0.016 n=4+5)
RWMutexWrite100-48         145ns ± 7%     96ns ± 7%    -33.70%  (p=0.008 n=5+5)
RWMutexWrite10-48          101ns ± 2%    153ns ± 5%    +51.74%  (p=0.008 n=5+5)
RWMutexWorkWrite100-48     355ns ± 2%    211ns ±20%    -40.65%  (p=0.008 n=5+5)
RWMutexWorkWrite10-48      535ns ± 8%    463ns ±13%    -13.47%  (p=0.016 n=5+5)

Coincidentally, RWMutexUncontended is perfectly designed as a torture test for this implementation. It takes 2 concurrent read-locks (which puts the lock into contention mode), and then takes a write lock, which has to take the lock back out of contention mode by doing an O(n) operation (where n here is GOMAXPROCS).

I need to write some more torture mode benchmarks that tickle the worst cases for the new implementation, and then I'll try to figure out how to upload these CLs to Gerrit.

Also, the new implementation is somewhat more lax about detecting mutex misuse; e.g. if you RUnlock() an unlocked mutex, the current implementation will consistently crash your program. The new implementation might not immediately do so when in contention mode. It might require several more RUnlocks to be certain of the issue and throwing (or calling Lock() on that corrupted mutex will also immediately detect the issue and throw).

Change https://golang.org/cl/215360 mentions this issue: sync: refactor RWMutex slightly to prepare for future changes.

Change https://golang.org/cl/215364 mentions this issue: sync: implement the RWMutex's lockTable type.

Change https://golang.org/cl/215361 mentions this issue: sync: Implement a version of RWMutex that can avoid cache contention.

Change https://golang.org/cl/215365 mentions this issue: sync: write a benchmark for RLocks with a full table.

Change https://golang.org/cl/215359 mentions this issue: sync: add benchmark for issue 17973.

Change https://golang.org/cl/215362 mentions this issue: sync: Implement a procLocal abstraction.

Update: the two problems with RWMutex is that it has poor multi-core scalability (this issue) and that its fairly big (takes up 40% of a cache-line on its own). I originally decided to look at multi-core scalability first, but upon reflection, it makes more sense to tackle these problems in the other order. I intend to return to this issue once #37142 is resolved.

There is one more possible approach to making RWMutex more scalable for readers - BRAVO (Biased Locking for Reader-Writer Locks) algorithm:
https://github.com/puzpuzpuz/xsync#rbmutex

It may be seen as a variation of D.Vyukov's DistributedRWMutex. Yet, the implementation is different since it wraps a single RWMutex instance and uses an array of reader slots to distribute the RLock attempts. It also returns a proxy object to the readers and internally uses a sync.Pool to piggyback on its thread-local behavior (obviously, that's a user-land workaround, not something mandatory).

As you'd expect, reader acquires scale better at the cost of more expensive writer locks.

I'm posting this for the sake of listing all possible approaches.

I made a benchmark, But not sure if the code is correct. Hope this helps.

for 10 concurrent 100k iters per each, RWMutex is fine in write, and slightly better in read.
for 100 concurrent 10k iters per each, RWMutex is slower in write, and impressive in read.
What surprises me is sync.Map did better in more concurrency.

https://github.com/ntsd/go-mutex-comparison?tab=readme-ov-file#test-scenarios