golang/go

runtime/cgo: immediately handoff P before returning to C host program

antJack opened this issue · 15 comments

What version of Go are you using (go version)?

$ go version
go version go1.19.3 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/yongjie.yyj/.cache/go-build"
GOENV="/home/yongjie.yyj/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/yongjie.yyj/gopath/pkg/mod"
GONOPROXY="*.alipay-inc.com,*.alibaba-inc.com,*.alipay.com"
GONOSUMDB="*.alipay-inc.com,*.alibaba-inc.com,*.alipay.com"
GOOS="linux"
GOPATH="/home/yongjie.yyj/gopath"
GOPRIVATE="*.alipay-inc.com,*.alibaba-inc.com,*.alipay.com"
GOPROXY="https://goproxy.cn"
GOROOT="/home/yongjie.yyj/go1.19.3"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/yongjie.yyj/go1.19.3/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.19.3"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1750707572=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Recently we're building our go program as dynamic linking lib(.so) and run it on a C-embedded program using cgo, and we found that there is still room for optimization.

As shown in the following demo, under the condition of limited P's resource, there's some delay between the cgo returns and the background goroutine being scheduled.

// foo.go
package main

import "C"
import (
	"fmt"
	"sync/atomic"
	"time"
)

var ch = make(chan int64, 1)
var t, n int64

func init() {
	go func() { // background goroutine
		for {
			start := <-ch
			now := time.Now().UnixNano()
			atomic.AddInt64(&t, now-start) // from cgo returns to background goroutine being scheduled
			atomic.AddInt64(&n, 1)
		}
	}()
}

//export foo
func foo() {
	ch <- time.Now().UnixNano() 
        // cgo returns
}

//export report
func report() {
	fmt.Println(atomic.LoadInt64(&t) / atomic.LoadInt64(&n), "ns")
}

func main() {}
// main.cc
#include <unistd.h>

#include "libfoo.h"

int main() {
	for(int i = 0; i < 10000; i++) {
		usleep(1000); // do somethings...
                foo();   // cgo call
	}
	sleep(1);
	report();
}

run

> go build -buildmode=c-shared -o libfoo.so foo.go     // build go as dynamic link lib

> gcc main.cc -lfoo -L./ -lpthread -o main -g        // build C host program

> GOMAXPROCS=1 LD_LIBRARY_PATH=./ ./main         // run demo under limited P's resource

The above demo indicates that there's some schedule delay between the cgo returns and the background goroutine being scheduled. After going through runtime code, we found that when the cgo returns, reentersyscall changes P's status to _Psyscall and left it waiting until sysmon retake, which leading to sub-optimized performance.

If we try to handoff p immediately after cgo returns, as shown in the related pr, we can observe much better cgo performance.

$ GOMAXPROCS=1 LD_LIBRARY_PATH=./ ./main-before
14214 ns

$ GOMAXPROCS=1 LD_LIBRARY_PATH=./ ./main-after
7163 ns

Therefore, this issue and the related pr request changes that the runtime could handoff p immediately before cgo returns to the C host program for better performance. However, how to determine whether it's returning to C host program or it's just a normal syscall (should not handoff p) is still a question. A possible way is to add compiler directive such as //go:handoffp on the exported go function?

What did you expect to see?

the background goroutine should be scheduled as soon as possible

What did you see instead?

there is some delay between cgo return and the background goroutine being scheduled, leading to sub-optimized performance.

Change https://go.dev/cl/455418 mentions this issue: runtime/cgo: immediately handoff P before returning to C host program

cc @golang/runtime

When calling cgo, we don't do immediate handoff under the assumption that often the C function will return quickly, making it advantageous to keep the P fast path.

For the inverse case, I can see intuitively how it is less obvious that the C code will call Go again very soon. Unfortunately these are all heuristics, so I don't know what behavior tends to be in practice. It presumably varies widely from program to program.

What if we introduce a new compiler directive that allows programmers to provide hints to the runtime about which cgo functions are likely to be called again very soon or not?

//export foo
//go:handoffp      <- foo is *not* likely to be called soon, so the runtime can immediately release the P
func foo() {
    // xxx
}

//export bar
func bar() {     //  <- no hint, bar is likely to be called soon, just let P waiting on the fast path
    // xxx
}

Maybe I can try to work out a sample version in the few next days.

Or perhaps we can introduce a new env, just like GOMAXPROCS:

GOMAXPROCS=1 GOHANDOFFP=1 ./main

Although this approach does not provide precise control, it has the advantage of simplicity. The implementation can be further discussed, there is plenty of ways to achieve it. But for now, I think we can provide opportunity that allows programmers to decide whether to handoff P or not. What do you think?

In my opinion, I think handoff P immediately is better in most cases. It may deserve the default behavior.
Since we can not assume there will be another C call Go soon, after the previous Go function returns to C.

P is an expensive resource, we'd better not waste it, it's wasting P while it's waiting for another C call Go.

If we think that most C functions return quickly, then it seems to me that handing off the P immediately is not better. Better to let the goroutine continue with its cached context.

Handing off the P immediately is better if the C function takes a long time.

So we have to make a judgement call. We've decided that we think that on average C functions tend to return quickly.

If our program is mainly driven by Go, then I do agree that C functions tend to return quickly and we should not handoff the P. But if the program is mainly driven by C, the Go part is a library that runs embeddedly on a C host program (that's why Go provides build mode c-archive/c-shared), then the question may change from "how quickly the C function will return" to "how quickly the C host program will enter the Go lib again". The point is that different build modes may lead to different answers.

Another point is that maybe it's better to provide ways for users to tune their program according to their real situation. Just like what GOGC does, we can also let users determine whether the runtime should immediately handoff P or not. Otherwise, they can only passively accept the assumption that C functions tend to return quickly, no matter what they actually do.

mpx commented

Perhaps this could be detected (either for the callsite, or the C function)? Default to not handing off, and switch to handing off immediately if some threshold is reached.

The impact of handing off for short lived calls is relatively large - really don't want to do it if it is unnecessary.

I think this is just an unintentional bug in our implementation. When a C program calls into Go, we have to acquire an M, which acquires a P. When we return to C, the standard entersyscall marks the P as _Psyscall and stores it in m.oldp. Then because we are returning to C entirely, we release the M back to the needm pool. It doesn't make much sense for a released M to still be referencing the old P. Sure, another call into Go could get the M back and then get the P back, but it might not even be the same thread.

My apologies, I think I misunderstood the code earlier. I agree with @prattmic that this is a bug that we should fix.

ok, do we have any idea on how to fix it? I can try to work it out in on the related cl. Maybe we should take different actions according to build-mode in reentersyscall?

It's not a question of the buildmode, it's a question of whether a Go function is returning to a C function that was not called by a Go function. I think that the dropm function can check m.oldp. If it is still in _Psyscall state, it can switch it to _Pidle state and call handoffp.

PTAL, I've moved the handoffp into the dropm in CL455418

It's worth noting that https://go.dev/cl/392854 is touching some of this code as well.

Yeah, we'd better not add it to dropm, dropm will be skipped totally in CL 392854

I think it's better to check m.isextra instead.