runtime/cgo: immediately handoff P before returning to C host program
antJack opened this issue · 15 comments
What version of Go are you using (go version
)?
$ go version go version go1.19.3 linux/amd64
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/home/yongjie.yyj/.cache/go-build" GOENV="/home/yongjie.yyj/.config/go/env" GOEXE="" GOEXPERIMENT="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOINSECURE="" GOMODCACHE="/home/yongjie.yyj/gopath/pkg/mod" GONOPROXY="*.alipay-inc.com,*.alibaba-inc.com,*.alipay.com" GONOSUMDB="*.alipay-inc.com,*.alibaba-inc.com,*.alipay.com" GOOS="linux" GOPATH="/home/yongjie.yyj/gopath" GOPRIVATE="*.alipay-inc.com,*.alibaba-inc.com,*.alipay.com" GOPROXY="https://goproxy.cn" GOROOT="/home/yongjie.yyj/go1.19.3" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/home/yongjie.yyj/go1.19.3/pkg/tool/linux_amd64" GOVCS="" GOVERSION="go1.19.3" GCCGO="gccgo" GOAMD64="v1" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="/dev/null" GOWORK="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1750707572=/tmp/go-build -gno-record-gcc-switches"
What did you do?
Recently we're building our go program as dynamic linking lib(.so) and run it on a C-embedded program using cgo, and we found that there is still room for optimization.
As shown in the following demo, under the condition of limited P's resource, there's some delay between the cgo returns and the background goroutine being scheduled.
// foo.go
package main
import "C"
import (
"fmt"
"sync/atomic"
"time"
)
var ch = make(chan int64, 1)
var t, n int64
func init() {
go func() { // background goroutine
for {
start := <-ch
now := time.Now().UnixNano()
atomic.AddInt64(&t, now-start) // from cgo returns to background goroutine being scheduled
atomic.AddInt64(&n, 1)
}
}()
}
//export foo
func foo() {
ch <- time.Now().UnixNano()
// cgo returns
}
//export report
func report() {
fmt.Println(atomic.LoadInt64(&t) / atomic.LoadInt64(&n), "ns")
}
func main() {}
// main.cc
#include <unistd.h>
#include "libfoo.h"
int main() {
for(int i = 0; i < 10000; i++) {
usleep(1000); // do somethings...
foo(); // cgo call
}
sleep(1);
report();
}
run
> go build -buildmode=c-shared -o libfoo.so foo.go // build go as dynamic link lib
> gcc main.cc -lfoo -L./ -lpthread -o main -g // build C host program
> GOMAXPROCS=1 LD_LIBRARY_PATH=./ ./main // run demo under limited P's resource
The above demo indicates that there's some schedule delay between the cgo returns and the background goroutine being scheduled. After going through runtime code, we found that when the cgo returns, reentersyscall
changes P's status to _Psyscall
and left it waiting until sysmon retake, which leading to sub-optimized performance.
If we try to handoff p immediately after cgo returns, as shown in the related pr, we can observe much better cgo performance.
$ GOMAXPROCS=1 LD_LIBRARY_PATH=./ ./main-before
14214 ns
$ GOMAXPROCS=1 LD_LIBRARY_PATH=./ ./main-after
7163 ns
Therefore, this issue and the related pr request changes that the runtime could handoff p immediately before cgo returns to the C host program for better performance. However, how to determine whether it's returning to C host program or it's just a normal syscall (should not handoff p) is still a question. A possible way is to add compiler directive such as //go:handoffp
on the exported go function?
What did you expect to see?
the background goroutine should be scheduled as soon as possible
What did you see instead?
there is some delay between cgo return and the background goroutine being scheduled, leading to sub-optimized performance.
Change https://go.dev/cl/455418 mentions this issue: runtime/cgo: immediately handoff P before returning to C host program
cc @golang/runtime
When calling cgo, we don't do immediate handoff under the assumption that often the C function will return quickly, making it advantageous to keep the P fast path.
For the inverse case, I can see intuitively how it is less obvious that the C code will call Go again very soon. Unfortunately these are all heuristics, so I don't know what behavior tends to be in practice. It presumably varies widely from program to program.
What if we introduce a new compiler directive that allows programmers to provide hints to the runtime about which cgo functions are likely to be called again very soon or not?
//export foo
//go:handoffp <- foo is *not* likely to be called soon, so the runtime can immediately release the P
func foo() {
// xxx
}
//export bar
func bar() { // <- no hint, bar is likely to be called soon, just let P waiting on the fast path
// xxx
}
Maybe I can try to work out a sample version in the few next days.
Or perhaps we can introduce a new env, just like GOMAXPROCS:
GOMAXPROCS=1 GOHANDOFFP=1 ./main
Although this approach does not provide precise control, it has the advantage of simplicity. The implementation can be further discussed, there is plenty of ways to achieve it. But for now, I think we can provide opportunity that allows programmers to decide whether to handoff P or not. What do you think?
In my opinion, I think handoff P immediately is better in most cases. It may deserve the default behavior.
Since we can not assume there will be another C call Go soon, after the previous Go function returns to C.
P is an expensive resource, we'd better not waste it, it's wasting P while it's waiting for another C call Go.
If we think that most C functions return quickly, then it seems to me that handing off the P immediately is not better. Better to let the goroutine continue with its cached context.
Handing off the P immediately is better if the C function takes a long time.
So we have to make a judgement call. We've decided that we think that on average C functions tend to return quickly.
If our program is mainly driven by Go, then I do agree that C functions tend to return quickly and we should not handoff the P. But if the program is mainly driven by C, the Go part is a library that runs embeddedly on a C host program (that's why Go provides build mode c-archive/c-shared
), then the question may change from "how quickly the C function will return" to "how quickly the C host program will enter the Go lib again". The point is that different build modes may lead to different answers.
Another point is that maybe it's better to provide ways for users to tune their program according to their real situation. Just like what GOGC
does, we can also let users determine whether the runtime should immediately handoff P or not. Otherwise, they can only passively accept the assumption that C functions tend to return quickly, no matter what they actually do.
Perhaps this could be detected (either for the callsite, or the C function)? Default to not handing off, and switch to handing off immediately if some threshold is reached.
The impact of handing off for short lived calls is relatively large - really don't want to do it if it is unnecessary.
I think this is just an unintentional bug in our implementation. When a C program calls into Go, we have to acquire an M, which acquires a P. When we return to C, the standard entersyscall
marks the P as _Psyscall
and stores it in m.oldp
. Then because we are returning to C entirely, we release the M back to the needm
pool. It doesn't make much sense for a released M to still be referencing the old P. Sure, another call into Go could get the M back and then get the P back, but it might not even be the same thread.
My apologies, I think I misunderstood the code earlier. I agree with @prattmic that this is a bug that we should fix.
ok, do we have any idea on how to fix it? I can try to work it out in on the related cl. Maybe we should take different actions according to build-mode in reentersyscall
?
It's not a question of the buildmode, it's a question of whether a Go function is returning to a C function that was not called by a Go function. I think that the dropm
function can check m.oldp
. If it is still in _Psyscall
state, it can switch it to _Pidle
state and call handoffp
.
It's worth noting that https://go.dev/cl/392854 is touching some of this code as well.
Yeah, we'd better not add it to dropm
, dropm
will be skipped totally in CL 392854
I think it's better to check m.isextra
instead.