golang/go

runtime: change golang scheduler implementation to a lock-free stack

alphadose opened this issue · 17 comments

Limiting the number of goroutines and re-using them as much as possible will lead to a significant gain in performance in critical areas like net/http.

itogami is a thread-pool implementation which is the most efficient among all existing threadpools because it uses golang runtime internal functions like getg(), goready() etc for scheduling. Benchmarks to support the claims present in the repository itself.

The addition of a thread-pool to golang core will unlock lots of opportunities for solving performance-critical problems.

Also since itogami already uses runtime internal functions, it would be take no effort to integrate it with the core lib.

Limiting the number of goroutines and re-using them as much as possible will lead to a significant gain in performance in critical areas like net/http.

That is not obviously true. At least, it's not obvious to me.

And, sorry for being pedantic, do you mean a thread pool or a goroutine pool? After all, in Go threads and goroutines are different things.

CC @bcmills

You've marked this as a proposal but this seems like a pure performance issue, not a change in API. Performance improvements do not need to go through the proposal process.

You've marked this as runtime but I don't understand what would change in the runtime package.

Looking at the code, it's a goroutine pool. I don't see how things like net/http would work with this, by limiting the amount of running goroutines you've set a hard limit on the amount of concurrent work (connections) it can handle.

@ianlancetaylor its a goroutine pool, I have used the words threads and goroutine pool interchangeably but that apparently has created some confusion. My bad.

You've marked this as runtime but I don't understand what would change in the runtime package.

Again , my bad. Didnt really chose the words well. What I meant was the addition of this library to sync package specifically.

Limiting goroutines for concurrency and aggressively re-using them leads to improvement in performance
This has been observed in golang http libraries like fasthttp and gnet which uses their own internal goroutine pool and have better performance than net/http

The problem with creating unlimited goroutines is that it does a new goroutine allocation for every request and it also increases the burden on the garbage collector for collecting and freeing up the memory used by goroutines in its fire and forget model which is detrimental in the long run (especially for cases like HTTP servers). But if we use a goroutine pool as demonstrated by fasthttp and gnet, we can gain performance improvements by considerably reducing the allocations as well as memory usage.

Also with infinite goroutines there will be too much context switching than doing actual work which affects latency.

@seankhliao

Looking at the code, it's a goroutine pool. I don't see how things like net/http would work with this, by limiting the amount of running goroutines you've set a hard limit on the amount of concurrent work (connections) it can handle.

True, a hard limit has been set on concurrent connections but there are only so many CPU cores and with infinite goroutines there will be higher context switching that doing actual work which affects latency and throughput. Hence, limiting concurrency will actually lead to better performance. It will also reduce the strain on garbage collector considerably

Bonus factor:- since itogami is optimized for CPU cache, retrieving goroutine pointers and calling goready() will become faster and faster over time (consumes lesser CPU cycles), even faster than spawning a new goroutine via go func()

Limiting concurrency may lead to better performance but if we limit the number of possible simultaneous net/http connections that will break real programs. We can't make a change like that.

but if we limit the number of possible simultaneous net/http connections that will break real programs.

relevant issue #6785

^ Not limiting the possible simultaneous net/http requests also exhausts sockets and breaks programs

isn't that why Transport.MaxConnsPerHost was added ?

imo having no restrictions is good theoretically but there are limits on hardware and sooner or later a choke point will arrive during the application run

We obviously couldn't enable a default server-side max concurrency limit, in any case. It would break use-cases like servers implementing long-polling. I suppose it could be an option, but it would be incredibly rarely used, since it has significant sharp edges (e.g. what if you include a library with a longpoll handler, it'll take down your entire server in relatively short order).


The notion of a fixed goroutine pool is dependent on it being expensive to "allocate" new goroutines. It seems much cleaner for the runtime to directly fix that cost (if it exists), rather than adding a new API that everyone will have to use.

As an example: if a goroutine runs off the end of its stack, and assuming that nothing else can have references to a goroutine once its exited (right? with the possible exception of unsafe code, which we don't owe any API compatibility promise to), it seems relatively straightforward to reuse the memory for that goroutine on the next go statement.

(I suppose this is in some sense having the runtime itself pool and reuse goroutine structures, but with the advantage that it will transparently improve all programs as opposed to a subset)

Regarding long-polling yes, this pool is ill-suited for that purpose. But it is useful for extremely short-lived connections the kind we see in high-frequency trading servers as well as multiplayer game servers.

It seems much cleaner for the runtime to directly fix that cost (if it exists), rather than adding a new API that everyone will have to use.

Agreed, this would be the best approach. Recycling of goroutines is better than fire and forget for high throughput bound cases. Doing so will allow net/http to attain the performance as good as fasthttp and gnet without breaking any API compatibility. The internal runtime pool should focus more on using pre-existing goroutines (if any) rather than allocating an entirely new one.

I was doubtfull of itogami's claim to be really fast.

I then made a generator based approach where the pool workers have reverse bingings to a closure from the consumer to get more tasks which is about 1.95x faster in this benchmark:
alphadose/go-threadpool-benchmarks#1
This is also faster than "unlimited goroutines" while also keeping the allocs very low.

If we remove the sleep statement from demofunc (because generator is so fast it actually spend most of it's time sleeping), to purely target the pool overhead. This is 14x faster than unlimited goroutines
Which is obvious why, it is actually benchmarking atomic.AddUint64.

This not a new pattern, it is simple and short, variations of this (lazily generating the tasks when workers are not busy) can be seen in many codebases. I don't think it need to be a library for this.


The work generator pattern has a different API which is slightly less convenient to use, but I don't think there lots of apps that can't be rewritten to use it.

For example since it is talking about net/http the generator here would just call Accept. (I'm not saying would be better than what is currently in net/http, just if you had to make it use a goroutine pool)


My understanding is that in go slow things should be hard to do. That why we can't cast a slice of structs to a slice of interfaces, or why select doesn't accept channel slices, ...

IMO, Adding a goroutine pool to the std is making a slow thing easy to do.

As an example: if a goroutine runs off the end of its stack, and assuming that nothing else can have references to a goroutine once its exited (right? with the possible exception of unsafe code, which we don't owe any API compatibility promise to), it seems relatively straightforward to reuse the memory for that goroutine on the next go statement.

The runtime already does this. See runtime/proc.go: newproc and goexit1.

As an aside, the x/sync.ErrGroup just recently introduced a SetLimit method to put a cap on the number of goroutines. I'm not sure what additional benefit there would be from using runtime internal functions for that.

@smasher164 the performance of itogami is better than x/sync.ErrGroup with SetLimit. Here are the benchmarks for errgroup

❯ go test -bench=. -benchmem *.go
goos: darwin
goarch: arm64
BenchmarkUnlimitedGoroutines-8   	       4	 298433114 ns/op	96399720 B/op	 2003818 allocs/op
BenchmarkErrGroup-8              	       2	 546261625 ns/op	120129120 B/op	 3001345 allocs/op
BenchmarkItogamiPool-8           	       4	 324749792 ns/op	26321712 B/op	 1055527 allocs/op

Here is the benchmarking code https://github.com/alphadose/go-threadpool-benchmarks/blob/master/general_test.go

[...] is making a slow thing easy to do.

This sentence/sentiment doesn't seem to be logically sound, as slow and easy aren't exactly comparable, more so as you seem to focus on fast; what do you want to express?

what do you want to express?

slow things should be hard to do

With golang 1.19, I made comparisons of golang scheduler against itogami scheduler for unlimited goroutines. Here are the benchstat for 20 cases ( code available here )

name                time/op
GolangScheduler-8    327ms ± 2%
ItogamiScheduler-8   334ms ± 3%

name                alloc/op
GolangScheduler-8   96.7MB ± 1%
ItogamiScheduler-8  16.0MB ± 0%

name                allocs/op
GolangScheduler-8    2.01M ± 0%
ItogamiScheduler-8   1.00M ± 0%

From the above results it is evident that the itogami scheduler takes almost the same time as golang scheduler but consumes much lesser memory and has half the number of allocations.

The current golang scheduler works on the mechanism of lock first via lock(&sched.lock) followed by runqput()/globrunqput() placing the goroutine pointer in a queue followed by unlock(&sched.lock)

Here is an snippet from the golang codebase displaying this

func goschedImpl(gp *g) {
	status := readgstatus(gp)
	if status&^_Gscan != _Grunning {
		dumpgstatus(gp)
		throw("bad g status")
	}
	casgstatus(gp, _Grunning, _Grunnable)
	dropg()
	lock(&sched.lock) // acquire scheduler lock
	globrunqput(gp) // place goroutine pointer in queue
	unlock(&sched.lock) // release lock

	schedule()
}

I propose changing the current golang scheulder implementation from lock + queue + unlock to a lock-free stack same as that of itogami (golang-scheduler branch)

There are 2 major benefits to this:-

  1. Using a lock-free structure will avoid the lock(&sched.lock)/unlock(&sched.lock) system calls every time ensuring the scheduling is done entirely in user-space thereby avoiding the overhead of context switching to kernel space.
  2. Using a stack instead of a queue will aggressively re-use already existing goroutines leading to lower memory consumption as demonstrated in the benchmarks above. Stack data structure will also keep the CPU caches warm leading to lower scheduling latencies.

To further highlight this issue I made another micro-benchmark test here which does a simple recursive computation and sleeps for 1 microsecond in each call.

Here are the results for 20 cases:-

name                time/op
GolangScheduler-8   4.50ms ± 3%
ItogamiScheduler-8  3.88ms ± 7%

name                alloc/op
GolangScheduler-8   96.1kB ± 0%
ItogamiScheduler-8  16.0kB ± 0%

name                allocs/op
GolangScheduler-8    2.00k ± 0%
ItogamiScheduler-8   1.00k ± 0%

In this case specifically it is evident that the itogami scheduler triumphs the current golang scheduler for all 3 metrics including time