runtime: severe performance drop for cgo calls in go1.22.5
zeroshade opened this issue ยท 11 comments
Go version
go version go1.22.5 linux/amd64
Output of go env
in your module/workspace:
GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/zeroshade/.cache/go-build'
GOENV='/home/zeroshade/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/zeroshade/go/pkg/mod'
GONOPROXY='github.com/voltrondata/*'
GONOSUMDB='github.com/voltrondata/*'
GOOS='linux'
GOPATH='/home/zeroshade/go'
GOPRIVATE='github.com/voltrondata/*'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/lib/go-1.22'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/lib/go-1.22/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.22.5'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/zeroshade/Projects/GitHub/arrow-adbc/go/adbc/go.mod'
GOWORK='/home/zeroshade/Projects/GitHub/go.work'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build279063330=/tmp/go-build -gno-record-gcc-switches'
What did you do?
Building a shared library with Go that is loaded and called from C. I was able to reproduce a trivial example with the following:
package main
import "C"
//export trivialFunctionCall
func trivialFunctionCall(v uintptr) C.int {
return C.int(v)
}
func main() {}
Build with go build -o libtrivial.so -buildmode=c-shared main.go
#include "libtrivial.h"
#include <stdio.h>
int main() {
int a = 0;
for (int i = 0; i < 500; ++i) {
a = trivialFunctionCall((size_t)i);
}
printf("Done %d\n", a);
}
Build with gcc -o trivial trivial.c -ltrivial -L. -Wl,-rpath,.
.
Then we run and record the performance: perf record --call-graph lbr -- ./trivial
and look at the report with perf report -g
.
What did you see happen?
The overwhelming amount of time in the cgocall is spent in __isoc99_sscanf
as seen here:
+ 97.82% 0.00% trivial trivial [.] _start
+ 97.82% 0.00% trivial libc.so.6 [.] __libc_start_main
+ 97.82% 0.00% trivial libc.so.6 [.] 0x0000729379829d8e
+ 97.82% 0.00% trivial trivial [.] main
- 97.82% 0.00% trivial libtrivial.so [.] trivialFunctionCall
trivialFunctionCall
crosscall2
runtime.cgocallback.abi0
runtime.cgocallbackg.abi0
- runtime.cgocallbackg
- 96.56% runtime.callbackUpdateSystemStack
runtime.asmcgocall.abi0
- x_cgo_getstackbound
- 94.01% pthread_getattr_np
+ 76.00% __isoc99_sscanf
+ 9.14% __getdelim
2.96% getline
+ 2.06% fopen
+ 1.97% fclose
+ 1.88% cfree
2.56% pthread_getattr_np@plt
+ 1.25% runtime.exitsyscall
+ 97.82% 0.00% trivial libtrivial.so [.] crosscall2
+ 97.82% 0.00% trivial libtrivial.so [.] runtime.cgocallback.abi0
+ 97.82% 0.00% trivial libtrivial.so [.] runtime.cgocallbackg.abi0
+ 97.82% 0.00% trivial libtrivial.so [.] runtime.cgocallbackg
+ 96.56% 2.56% trivial libc.so.6 [.] pthread_getattr_np
+ 96.56% 0.00% trivial libtrivial.so [.] runtime.callbackUpdateSystemStack
+ 96.56% 0.00% trivial libtrivial.so [.] runtime.asmcgocall.abi0
+ 96.56% 0.00% trivial libtrivial.so [.] x_cgo_getstackbound
+ 76.00% 0.00% trivial libc.so.6 [.] __isoc99_sscanf
+ 42.52% 0.00% trivial libc.so.6 [.] 0x00007293798629e0
+ 16.45% 0.00% trivial libc.so.6 [.] 0x0000729379865265
+ 16.45% 0.00% trivial libc.so.6 [.] __strtoull_internal
+ 10.59% 3.14% trivial libc.so.6 [.] __getdelim
If we instead build using GOTOOLCHAIN=go1.22.4 go build -o libtrivial.so -buildmode=c-shared main.go
and re-run it, you see that the cost of callbackUpdateSystemStack
and pthread_getattr_np
don't even register in the performance trace:
+ 92.66% 92.66% trivial libtrivial.so [.] runtime.execute
+ 92.66% 0.00% trivial libtrivial.so [.] runtime.mallocgc
+ 92.66% 0.00% trivial libtrivial.so [.] runtime.gcTrigger.test
+ 92.66% 0.00% trivial libtrivial.so [.] runtime.morestack.abi0
+ 92.66% 0.00% trivial libtrivial.so [.] runtime.newstack
+ 92.66% 0.00% trivial libtrivial.so [.] runtime.goschedImpl
+ 92.66% 0.00% trivial libtrivial.so [.] runtime.schedule
+ 7.34% 0.00% trivial libtrivial.so [.] runtime.sigtramp.abi0
+ 7.34% 0.00% trivial libtrivial.so [.] runtime.sigtrampgo
+ 6.78% 6.78% trivial libtrivial.so [.] runtime.sighandler
+ 0.56% 0.56% trivial libtrivial.so [.] runtime.sigfwdgo
0.00% 0.00% trivial libtrivial.so [.] runtime.cgoSigtramp.abi0
The problem was introduced by 3560cf0afb which forces all calls to cgo to update the system stack bounds. Ultimately the problem traces down to the glibc implementation of pthread_getattr_np
which is called by
go/src/runtime/cgo/gcc_stack_unix.c
Line 28 in 792a261
In glibc, pthread_getattr_np
ends up directly parsing /proc/self/maps
by using sscanf
, hence the high cost as seen here https://github.com/lattera/glibc/blob/master/nptl/pthread_getattr_np.c#L81-L124
It looks like musl's implementation of pthread_getattr_np
isn't as expensive as glibc's version though.
For the full, in-context explanation see apache/arrow-adbc#2021 in context we're seeing a test run go from ~2s to now taking over 5.5s due to this issue and are concerned it may end up affecting consumers of the library
What did you expect to see?
Shifting from go1.22.4
to go1.22.5
shouldn't come with such a significant drop in cgo's FFI performance when calling a Go shared library function from C that the cost is dominated by calls to sscanf
if the underlying Go code itself isn't expensive enough. While the example used to reproduce is a trivial one to exemplify the problem, in the context of the actual library, the cgo calls are non-trivial and involve multiple network connection initializations, data processing and more, yet several of the calls end up completely dominated by the sscanf
call in pthread_getattr_np
.
Either we need find more situations where we can avoid having to fully update the stack bounds via x_cgo_getstackbound
or there needs to be a less expensive way to get those bounds than using pthread_getattr_np
when using glibc due to the high cost of it just parsing /proc/self/maps
.
Related Issues and Documentation
- runtime/cgo: immediately handoff P before returning to C host program #57103
- cmd/go: go build crashes #16180 (closed)
- runtime: simple programs crash on linux/386 with go1.21 when build with -gcflags='all=-N -l' #61975 (closed)
- cmd/link: panic opening c-shared DLL on windows #26084 (closed)
- runtime: with GDB, callstack falls into infinite loop with CGO on GOOS=linux GOARCH=arm64 #37238 (closed)
- runtime: cgo call to symbol from library loaded dynamically will panic with go 1.21.1 and ld >2.38 #63264 (closed)
- cmd/link: -buildmode=pie -linkshared generates incorrect code #25970 (closed)
- cmd/compile: functions are inlined beyond boundary of shared libraries #21510 (closed)
- runtime: panic when executing from multiple c-shared libraries #65050 (closed)
- runtime: unable to backtrace program using Go shared object #52608 (closed)
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
It also looks like the Rust devs are running into this problem too: https://internals.rust-lang.org/t/who-is-doing-read-proc-self-maps-1024-at-startup/17348
Thanks for the report. From the Rust post it sounds like it only does the read if it is called for the main thread. Could you try calling Go from a different thread, a pthread-created one?
@cherrymui I only included the rust post because it was referring to the same scenario of calling pthread_getattr_np
in hindsight it was probably not really great to include as the problem here is frequency (which is not the issue for the Rust post).
In the context of the original issue it's being called through Python loading up the shared library and then making calls to it. As this is a client library that we provide we don't have control over whether consumers call from the main thread or other threads. I can try calling from another thread instead and seeing if that mitigates the issue a bit, but that wouldn't be a viable solution IMO as we can't expect all calls to the shared library to be funneled through a separate thread instead of being able to use the main thread. Right?
Yes, trying calling from a different thread is for validating the assumption. Calling on the main thread should not be slow, either. So if that's the case we'll look into a targeted solution. Thanks.
Looks like you're right!
I updated the code from the reproducer to the following:
#include "libtrivial.h"
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
void* f(void*) {
int a = 0;
for (int i = 0; i < 500; ++i) {
a = trivialFunctionCall((size_t)i);
}
printf("Done %d\n", a);
pthread_exit(NULL);
}
int main() {
pthread_t thr;
int rc;
if ((rc = pthread_create(&thr, NULL, &f, NULL)) {
fprintf(stderr, "error: pthread_create: rc: %d\n", rc);
return EXIT_FAILURE;
}
pthread_join(thr, NULL);
}
After running the result using perf record -g
and checking the report, sscanf
is not showing up at all in the report anymore.
So this just needs to be handled for the main thread as it appears to not cause slowdowns on other threads, only if you call from the main thread.
In triage, we're thinking we're probably stuck with having to work around this by caching the stack bounds one way or another.
@cherrymui @mknyszek any updates here?
Sorry, I haven't got a chance to implement this. My plan is that if we can get an accurate stack bounds, which is the common case with glibc, we just get it once per thread and stick with it. If we cannot get the accurate bounds (on systems that pthread_getattr_np
is not available), we'll use an estimate every time, but computing the estimate is not costly. I'll probably combine the work for this and #68285.
Thanks for the update! That's about what I was expecting the fix would end up being