golang/go

proposal: runtime: expose current thread id or processor id

funny-falcon opened this issue ยท 68 comments

Often there is a need to reduce lock contention in algorithms.
Usual way to acomplish this is sharding, ie several separate locks protects non-intersecting parts of greater state.
But then there is a need for fast (and preferably non-locking) way to determine shard id.
And it will be great, if such shard id will further reduce lock contention.

  • Best way to do it is exposing current cpu number. On Intel cpu it could be done with CPUID instruction.
    CPUID it returns APIC-ID which are not consecutive, so probably runtime should analyze CPU topology to convert APIC-ID to consecutive cpu-number. On the other way, given some architectures allows hot cpu swapping, probably it is better to return APIC-ID as is.
  • Simple way is to expose thread id getg().m.id as runtime.Mid() . Cause program could contain a lot of threads (due to system calls, or cgo calls), there is a need to additional fast random value, so I propose to additionally expose runtime.fastrand() as runtime.Fastrand() (mixing it with secure random key to hide original fastrand value).

"Best way" could be implemented in some "golang/x/" package. "Simple way" needs to implement in "runtime" at least wrapper around "getg().m.id", to link with this wrapper from external library (if there is a way to call private "runtime" function from external package).

Historically we have always rejected goroutine-local information.

Your proposal is incomplete: you need to define exactly when the value is permitted to change, and you need to do so without overly constraining the runtime. Right now a g is permitted to move to a different m or p at any function call, which includes several implicit function calls inserted by the compiler for things like map opertions. In order to address #10958 it is likely that in future a releases a g will be permitted to move to a different m or p in any loop. It's not enough to define runtime.Mid without also providing a clear explanation for when it will return a different value.

The discussion in #17973 may be of interest.

Result of runtime.Mid() could be changed at any moment.
Value, returned from Mid() doesn't eliminate need of locking: since number of threads could be much larger than preallocated number of shards, locks are inevitable.
And since several threads could map into same shard number, there is a need for runtime.Fastrand(): shards are bucketized, bucket is choosen by Mid, shard in the bucket by Fastrand.
Mid() combined with Fastrand() are just a hint for reducing lock contention, but doesn't eliminates locks completely.

And since several threads could map into same shard number, there is a need for runtime.Fastrand(): shards are bucketized, bucket is choosen by Mid, shard in the bucket by Fastrand.

If runtime.Mid changes (e.g., because the goroutine is now being run on a new OS thread), how do you make sure a goroutine continues to use the correct bucket?

In case the answer is that each goroutine only calls runtime.Mid once, then why not just use math/rand.Int?

make sure a goroutine continues to use the correct bucket?

It is just hint to select a bucket. Per-shard mutex still will be acquired, so there is no "fail" if runtime.Mid will be changed after call. But most of time it will be not changed, and this will help to reduce mutex contention.

Problem with math/rand.Int is single global mutex around this call. See my previous closed proposal: #18514

BenchmarkFastRand-4             500000000                3.49 ns/op
BenchmarkFastRand-4             500000000                3.49 ns/op
BenchmarkMathRand-4             10000000               162 ns/op
BenchmarkMathRand-4             10000000               159 ns/op

Does syscall.Gettid not give you a thread ID that roughly meets this criteria already? Why does the runtime need to provide it?

I bet syscall.Gettid is expensive. No?

Probably, syscall.Gettid() is not so expensive on Linux, but it is much more expensive on other platforms (and probably doesn't exists on Windows).

@kostya-sh Here is example of usage of runtime.Mid() and runtime.Fastrand() based on your benchmark: https://gist.github.com/funny-falcon/d8adc39e777b79e56274a65abca29e78

results are:

BenchmarkRandInt63_Global-4                     10000000               154 ns/op
BenchmarkRandInt63_Channel-4                    10000000               220 ns/op
BenchmarkRandInt63_Pool-4                       50000000                28.8 ns/op
BenchmarkSourceInt63_Pool-4                     50000000                34.9 ns/op
BenchmarkRandInt63_Source-4                     300000000                5.71 ns/op
BenchmarkRandInt63_ShardsGettid-4               50000000                29.2 ns/op
BenchmarkRandInt63_ShardsMid-4                  100000000               14.2 ns/op
BenchmarkRandInt63_ShardsFastrand-4             30000000                49.0 ns/op
BenchmarkRandInt63_ShardsMidFastrand-4          100000000               17.1 ns/op

It shows that using Mid() (and combination with Fastrand()) could be almost twice faster than single Pool.

Looks like exposing getg().m.p.ptr().id gives even better results.

Looks like sync.Pool already uses hiddenly exposed getg().m.p.ptr().id (through runtime_procPin). So, effectiveness of sharding by P.id is known to core team.

So why not expose it to regular Go users?

(of cause, runtime_procPin does more than simply returning current P.id.
But just returning P.id with runtime.Pid() will be already great help).

So why not expose it to regular Go users?

Because Go considers simplicity a virtue. When there are trade-offs between simplicity and performance, Go considers simplicity. You seem to have a different value trade-off, and it's obvious you prefer performance above all else. That's probably why many of your proposals to Go end up getting rejected. Please keep that in mind for future proposals.

We understand that things can be always be faster, and we want things to be fast, but only if things are also simple and have clean APIs. We do not want to document and support and export all possible innards.

But such thing as P.id is very simple thing. Does anyone expect changing of Go runtime scheduler core schema? Will P ever be removed?

That's probably why many of your proposals to Go end up getting rejected.

"just keep pushing" :-)

Pushing the same ideas over and over is actually more likely to annoy people. Please try to understand the space of trade-offs that Go strives for.

But such thing as P.id is very simple thing.

That's not the point. We don't expose every "simple thing" just because it's simple. We also have to then document it, support it, test it, and keep it available forever. And we have to consider whether it's redundant with other API, or will be misused, etc.

Does anyone expect changing of Go runtime scheduler core schema?

It has in the past.

@funny-falcon My main impression from this thread is it's not obvious that the design space has been exhausted. E.g., you could already locklessly assign a unique ID to each worker yourself. Why have you ruled that out?

@mdempsky single lockless approach awailable in Go is atomic.AddUintxx , but it is also doesn't scale with a lot of cpu cores cause of cacheline contention.

Any "lock" or "lockless" technique will fail to scale without hint about current scheduling item. That is why it were used for sync.Pool, yes?

@jonhoo uses CPUID instruction and parsing '/proc/cpuinfo' , but it works only on linux x86 : https://github.com/jonhoo/drwmutex

The Go runtime is happy to use per-P or per-M data in support of specific, simple, well-defined functionality that can, if necessary, be provided in other ways. If you look back at the history of sync.Pool, you will see a lot of discussion about exactly these issues: is sync.Pool simple enough, well-defined enough, for people to use successfully.

To me your proposal seems very loosely defined. It's nearly indistinguishable from "give me a random number in the range 0 to GOMAXPROCS, and try to give me the same random number every time, but don't worry if you don't." That's not a well-defined API that people can use to write good code, because the expectations are not clear. Since the expectations are not clear, people will rely on the implementation, and that means that if we change the implementation the performance characteristics of people's code will change. That's very problematic for a feature that exists only for its effect on performance.

Pushing the same ideas over and over is actually more likely to annoy people.

I pushed different ideas. Yes, I am annoying a bit. But someone still should push strange ideas. At least, there will be documented rejection of such idea.

We don't expose every "simple thing" just because it's simple.

That is why here is discussion of proposal: to decide worthwhile it or not.

It has in the past.

Given two most successful current green thread runtimes (Go and Erlang) has similare structure (at least, they both has per-core schedulers - P ), it is quite unfortunately for next big change. I could be mistaken.

@davecheney Original my use-case is still rpc. I just realized that runtime.Pid() is simply better.
Database, for which I'm maintaining connector, may handle 1Mrps easily, and at this rate every possibility for improvement is worthwhile.

I saw "per-cpu local info" requests at least several times during discussion of this and previous proposal (walking by links), so it doesn't look like it is just my crazy need. And sharding by runtime.Pid() looks like very close approximation, even if Mutex is still involved (ie no pinning like for sync.Pool).

@ianlancetaylor

It's nearly indistinguishable from "give me a random number in the range 0 to GOMAXPROCS, and try to give me the same random number every time, but don't worry if you don't."

Api for "Pid()" is "give me a number, that approximates scheduling internals: cpu number or scheduler (P), - so that two goroutines that calls this method simultaneously (ie in parallel) will receive different numbers with very high probability, so if I use it for choosing mutex or other synchronization primitive, it will likely to be non-contended. Preferable it should be bound with understandable small value (num-of-cpu cores or GOMAXPROCS)."

Currently I choose shard for putting request into by atomic increment. But it means, even if goroutine is not preempted, it will spread its requests across many shards. (Shard - is datastructure that aggregates requests before network writer writes them into connection).

With Pid() cache locality will be much improved and lock contention decreased.

Another obvious use-case for Pid is statistic collector: without such hint even atomic operations will fail to scale on multi-core system. With such hint statistic will scale smoothly.

Another obvious use-case for Pid is statistic collector: without such hint even atomic operations will fail to scale on multi-core system. With such hint statistic will scale smoothly.

Please, can we talk about your specific problem, not other problems that other people may be able to solve with this feature.

Let it be called not Pid but ProcHint() :

// ProcHint gives a hint about current scheduler item
// (for example, cpu nunber, or id of internal scheduler).
// ProcHint is bounded by 0 <= ProcHint < ProcHintMax
// You may use it to reduce lock contention if you may allocate
// ProcHintMax independent mutexes or channels and
// then choose them using ProcHint.
func ProcHint() int

// ProcHintMax gives upper bound of ProcHint.
// Note: it could be changed in runtime, for example,
// if you change GOMAXPROCS or if you hot-swap cpu.
func ProcHintMax() int
minux commented

but hard to formally design and get accepted.

Hard doesn't mean impossible. It means, more work should be put in.

// ProcMaxHint returns the approximate number of scheduler items
// (for example, CPU cores or runtime scheduler P).
// It tends to be dense, ie schedulers are likely to have continuous id
// in range 0 <= id < ProcMaxHint.
// Note: it could be changed in runtime, for example, if you change
// GOMAXPROCS or if you hot-swap cpu or if you call runtime.LockOSThread().
func ProcMaxHint() int {}

// ProcHint returns the approximate hint about current scheduler id
// (for example, cpu core number, or id of internal scheduler).
// ProcHint is likely to be in range 0 <= id < ProcMaxHint, if ProcMaxHint is not changed,
// (but note, that ProcMaxHint could be changed at runtime).
// You may use it to reduce lock contention if you may allocate
// ProcMaxHint independent mutexes or channels and then choose them using ProcHint.
//
//     type counterMap struct {
//         m sync.Mutex
//         c map[string]int
//         _pad [64]byte
//     }
//     type Stat []counterMap
//     func NewStat() Stat {
//         s := make([]counterMap, runtime.ProcMaxHint())
//         for i := range s {
//             s[i].c = make(map[string]int)
//         }
//         return s
//     }
//     func (s State) Increment(key string, val int) {
//         // since ProcMaxHint could be changed
//         // ProcHint could be larger than len(s)
//         i := runtime.ProcHint() % len(s)
//         s[i].Lock()
//         s[i].c[key] += val
//         s[i].Unlock()
//     }
func ProcHint() int {}
minux commented
  1. It doesn't really important to have hard guarantee that hint is less than max in practice. It is mentioned in functions docs and clearly shown in example, that user code just ought to be ready for hint to be greater than previously gathered maxhint.
    (Same way, code in sync.Pool just ought to be ready for P.id to be greater than previously gathered max).
  2. There is no need in definition "how dense". User will use this API when lock contention is much more important for him than exact memory usage. It is just enough for user to now that it is "adequately dense".
    And btw there is definition: "It tends to be dense, ie schedulers are likely to have continuous id in range 0 <= id < ProcMaxHint." . "tends... to have continous id in range" means usually it is quite close to 1, no? Probably it could be formulated in clearer way, I just don't know english too well.
  3. There is no need in this guarantee. It is explicitely called "Hint" and "approximate value", and explicitely said in documentation that it doesn't eliminate need in synchronization. It just helps reduce lock contention, but doesn't guarantee to eliminate lock contention.

Some of us on the Go team at Google have been kicking around a different idea for a year or so that I think would solve this without exposing implementation details. I don't think there's been any online discussion of this idea, so I'll sketch out a proposal and link it from here. (However, I'm writing a talk for Friday, so don't expect it right away. :)

Just for kicks I tried to see just how far I could push sync.Pool. Turns out, the answer is "pretty far": a per-thread library using sync.Pool and runtime.SetFinalizer turns out hacky and a bit space-inefficient, but the API is pretty usable and the performance is better than I expected.

For the curious:
https://go-review.googlesource.com/#/c/35676/

@bcmills I thought about "allocator style" api. I even started to write comment with definition of such API. Unfortunately I understood it will be tricky to use:

  • sometimes I need to lock all allocated "pieces"
  • to do that I ought to call Range() and lock individual item one by one. But I also have to be sure no new items will be allocated at this moment (it is ok to return already allocated items).
  • it leads to single mutex in or arround Allocator, which returns us to lock contention.

So that I decided not to send that comment (despite I spend a hour to write it). I've understood that application/library knows more about allocation and locking pattern than it could be supposed ahead. So "*Hint" api will be more convenient.

Probably, I missed something. Maybe there is super-lightweight synchronization trick around of number of allocatef items.

For example, Allocator may force preallocation for all current P, and do not allocate new items without call to "Reallocate()". (And probably there should be method "WantsToReallocate()").

@funny-falcon

  • to do that I ought to call Range() and lock individual item one by one. But I also have to be sure no new items will be allocated at this moment (it is ok to return already allocated items).
  • it leads to single mutex in or around Allocator, which returns us to lock contention.

It does lead to a mutex in the Allocator, and that's fine in practice. You don't need to eliminate all contention: you only need to eliminate contention in the steady-state.

In my experience, the calls that require Range tend to be infrequent enough that their impact on steady-state contention is minimal.

That said, you could use an approach similar to the one in change 33912 to avoid locking on the Range path entirely in the steady state, so that you only take the mutex when actually allocating and perhaps for a small number of calls thereafter (to reduce the overhead of reallocating the lock-free data structures).

See also #8281

@bcmills, in my usage case, I use "shard number" as a low bits of request number. So when answer arrived, I may find "shard" by this low bits.
But to use low bits, I have to know maximum.
With my api, I will call ProcMaxHint() only once at startup, will round it to closest power of 2, and then will mask ProcHint() with only low bits:

   s.max = closestPower2(ProcMaxHint())
   s.shards = allocShards(s.max)
   /////////
   shardN = ProcHint() & (s.max-1)
   shard = s.shards[shardN]
   reqId = shard.nextReq()

It will be difficult to do with Allocator api.

Edit: fixed s/ProcMax/ProcHint/

@funny-falcon What is the underlying problem you're trying to solve with that sharding? There may be a clearer way to address it with an API other than the one you're proposing.

note: i've fixed misspelling in previos example: s/ProxMax/ProcHint/

Fast RPC.
Sharding allows to reduce lock contention for RPS when it works at 1Mrps (through single connection using 4 cores)
Shard contains byte buffer for serialized request and hash map (custom) of requests waiting for response.
Currently I use AtomicAdd(1) for choosing shard.
With ProcHint lock contention will be lesser and CPU-cache usage will be greater. Probably, it will help to scale past 4 cores and over 1Mrps.

Fixed number of shards ought to be for response mapping.

Other possible usage is collection of statistic. But it could be solved with Allocator API.

Why my api is bad? It is low-level enough for efficient usage, but doen't expose "internals". And it is useful enough for those who know what they want.

Allocator could satisfy me, if there will be a way to costruct it prefilled with fixed size, and then get elements by id:

NewFixedAllocator(func(id int) interface) *FixedAllocator {}
func (a *FixedAllocator) Get() interface{} {}
// Size never changes
func (a *FixedAllocator) Size() int {}
func (a *FixedAllocator) GetById(id int) interface{} {}

But still, lowerlevel api could be much simpler and more useful.

I've posted a rough "sharded value" proposal as #18802. The API operates along similar lines to @bcmills' Allocator (not surprising, since the sharded value proposal came out of discussions with him), but the terminology is different and I've posed some trade-offs in the API design space.

@funny-falcon, perhaps I'm not understanding your use case, but why isn't a plain map the right answer for looking up requests awaiting response by shard?

@aclements, map is much-much slower at this modification rate. Both it locates and modifies value slower than custom hash, and single map ought to be protected with single mutex, which quickly become bottleneck at the rate of 1M rps.

@funny-falcon

Sharding allows to reduce lock contention for RPS when it works at 1Mrps (through single connection using 4 cores)
Shard contains byte buffer for serialized request and hash map (custom) of requests waiting for response.
Currently I use AtomicAdd(1) for choosing shard.

You're still describing an implementation, not a use-case. You're telling me that you want to calculate a shard, but not what you're using the sharding to actually do. (This looks like an XY problem.)

I'm assuming you're describing a server-side optimization. A simple server serving immutable data should generally read a request, compute the response, and send the response all on the same goroutine / thread, so how does the sharding come into play? I'm sure I'm missing something about your use-case, but I have no idea what.

No, I'm talking about connector to external database capable to serve > 1Mrps. And users actually wants to send 1Mrps.

Each of 1M requests per second is different. Each request should be sent over TCP connection, and filled with response.

I understood, fixed allocator api could be simpler:

//NewFixSharded preallocates all values by calling alloc function, and returns new FixSharded.
//FixSharded never changes its size, ie never allocates new value after construction.
NewFixShareded(alloc func() interface) *FixSharded {}
//NewFixShardedN preallocates exactly n values by calling alloc function, and returns new FixSharded.
NewFixSharededN(n int, alloc func() interface) *FixSharded {}
func (a *FixSharded) Get() interface{} {}

Edit: FixSharded.Get is non-locking, so application code should contain synchronization around value usage.

@funny-falcon

No, I'm talking about connector to external database capable to serve > 1Mrps.

I still don't know what that means. How about some pseudocode, or (better still) a link to some real code?

That is current code:

It is capable to send 1Mrps requests and waits for 1Mrps responses through one connection using 4 cores. Database ( https://tarantool.org ) does not even uses 100% of single core at this request rate, so there is room for improvements.

  • fetching "request" from hash map
  • reader fetches reads response

Don't those steps introduce cross-thread contention anyway? (How are you keeping those calls colocated to the same thread as the putFuture call?)

At any rate: that's very complicated, heavily asynchronous code. I would argue that it's not idiomatic for a Go program, and we should be optimizing for making idiomatic code more efficient rather than speeding up complicated code.

A much simpler example (e.g. a self-contained benchmark with a similar request flow) would be helpful.

"Simple should be simple, and complex should be possible".

If you will not add possibilities for "complex cases", they still will be written in C++.

  • fetching "request" from hash map
  • reader fetches reads response

Don't those steps introduce cross-thread contention anyway?

Yes. But they are sharded by incoming response id, and it is much shorter than sending request, cause doesn't contain deserialization of response (and sending contains serialization of request).

Cause of serialization into reusable batch buffer, CPU alignment is more important on request sending than on filling request with response bytes.

cznic commented

If you will not add possibilities for "complex cases", they still will be written in C++.

A language trying too hard to be loved by everyone ends loved by few.

@cznic two simple methods lowlevel could not destroy love for Golang. Impossibility to do something efficiently can.

Number of those, who will directly write code using proposed methods, will be small. And this people actually knows what they wants. For those people low-level methods are usually better.

But number of indirect users are much larger, cause number of users who will use libraries written by former people is more countable.

cznic commented

Number of those, who will directly write code using proposed methods, will be small.

I rest my case ๐Ÿ˜‰

@funny-falcon

Impossibility to do something efficiently can [destroy love for Go].

The example you've provided is so complex that it's very difficult to see whether your use-case is truly impossible to address efficiently (both with the standard library as it stands today and with the alternative proposals).

Which is to say: one of the great strengths of Go is that it is reasonably efficient for programs implemented in a direct, simple style. In many cases, attempts to "optimize" Go programs using idioms from other languages (e.g. futures and callbacks) end up fighting against optimizations in the runtime (escape analysis, goroutine scheduling, the transaction-oriented collector, etc.).

If you want to argue that something is necessary on efficiency grounds (and, let's be clear, the Go compiler and runtime have lots of room for improvement!), the first step is to describe the problem you're trying to solve as idiomatic code and show where the bottlenecks occur (e.g. using pprof profiles or similar).

About idiomatic Go code:

  • unsafe is not idiomatic, but many libraries ought to use it for performance
  • sync and sync/atomic is not idiomatic, but many libraries ought to use it for performance.
  • more and more often people uses fasthttp instead of idiomatic net/http . Why? Cause it is fast.

People choose Go cause it is simpler and faster. But it is faster not only cause it is "compiled and statically typed", but because there is possibility to write fast libraries in... and usage of sync, sync/atomic, unsafe is huge part of this possibility.

More possibilities to write faster libraries => more possibilities to write fast applications => more new users who want or need performance (hint: most of Go users).

@bcmills show me RPC written in idiomatic Go and capable to perform 1M requests per second. Then I will never write any new silly proposal. I will just use/copy code that you show me.

@funny-falcon There's no need to get hostile.

The point of asking for idiomatic code is not that it does, today, what you want it to do. The point is that it becomes much easier to see where and why it does not do what you want it to do, and the solution that enables your use-case is more likely to benefit the Go community as a whole rather than just one niche application.

And I disagree with what you wrote about unsafe, sync, and atomic. I would consider those โ€” and even reflect โ€” to be perfectly idiomatic in many situations. When I'm talking about "idiomatic Go code", I mean code with straightforward control flow, mostly synchronous function calls, mostly functional-style APIs, types with relatively few methods, methods with relatively few parameters, and so on.

The use of mostly synchronous functions is particularly important in that regard: synchronicity tends to produce good cache locality for server-style programs (with lots of hot "per-request" data, relatively little "background" data, and relatively few different kinds of tasks being performed), makes the behavior of the program clearer to the goroutine scheduler, and makes it a bit easier to thread parameters down through the call tree without introducing aliasing bugs (leading to, for example, the ability to allocate the "per thread" data at the top of a goroutine and pass it down through that goroutine rather than scattering map lookups throughout the code).

@bcmills I'm not even-tempered, but I'm not hostile.

I'm closing now. May be I'm mistaken somewhere.

I'd just expose a single function in runtime package:

// PID returns an id currently associated with the CPU core.
//
// The function may return distinct values for the same CPU core in the long run.
// There are no guarantees on the maximum and minimum id values.
func PID() uint32 {
  return getg().m.p.ptr().id
}

This functionality is the core of timers scalability CL - https://go-review.googlesource.com/#/c/34784/ .

@valyala, you did it! thank you for timers!

I'd just expose a single function in runtime package:

That is what I named ProcHint, cause core-team doesn't want for this to be named like a hook into internals.

I agree, single functions could be already enough.

@bradfitz @bcmills technique used by @valyala to improve timers at https://go-review.googlesource.com/#/c/34784/ exactly shows why runtime.ProcHint() (or Pid()) is useful for high performance multi-core programming.

I agree with @valyala that ProcMaxHint most likely doesn't need. It is enough to have fixed size number of "shards".

rsc commented

The valid uses of this function are far outweighed by the buggy invalid uses. There are of course uses of per-P data inside the runtime. That's not an argument for exposing P numbers outside the runtime. I would rather add a few higher-level APIs that are easier to use correctly, like the ShardedValue proposal, than add one low-level API that is very hard to use correctly.

Hey guys,

syscall.Gettid() works pretty well.

package main

import (
	"log"
	"runtime"
	"syscall"
)

func main() {
	runtime.GOMAXPROCS(runtime.NumCPU())

	num := 20
	sem := make(chan int, num)

	for i := 0; i < num; i++ {
		go func() {
			log.Println(syscall.Gettid())

			sem <- 0
		}()
	}

	for i := 0; i < num; i++ {
		<-sem
	}

	return
}

@happilymarrieddad , no it doesn't . There is only GOMAXPROCS of possible running schedulers (unless one use runtime.LockOsThread), and thousands of native threads (if some syscall or C library locks for a long time).