cloudwego/goref

goref panic when attach to a process with about 500M memory usage

oilbeater opened this issue · 15 comments

Describe the bug

goref panic when I attach to k3s to inspect the memory usage.

To Reproduce

Steps to reproduce the behavior:

  1. build k3s with DEBUG mode to enable the DWARF
  2. run ks3 server
  3. grf attach ${pid}

It panic with this output:

2024-07-23T05:57:20Z error layer=debugger could not resolve parametric type of s: parametric type without a dictionary
2024-07-23T05:57:20Z error layer=debugger could not resolve parametric type of s: parametric type without a dictionary
panic: runtime error: makeslice: len out of range

goroutine 1 [running]:
github.com/cloudwego/goref/pkg/proc.cacheMemory({0xa7d078, 0xc0001a4e70}, 0x0, 0xf876c00000000)
        /root/workspace/github.com/goref/pkg/proc/mem.go:79 +0x145
github.com/cloudwego/goref/pkg/proc.(*HeapScope).readType(0xc01cd763c0, 0xc00c3469c0, 0xc003e64734, 0xc003d79c88, 0xc003d7a200)
        /root/workspace/github.com/goref/pkg/proc/heap.go:330 +0x173
github.com/cloudwego/goref/pkg/proc.(*HeapScope).copyGCMask(0xc01cd763c0, 0xc00c3469c0, 0xc003d79c80)
        /root/workspace/github.com/goref/pkg/proc/heap.go:302 +0x96
github.com/cloudwego/goref/pkg/proc.(*ObjRefScope).findObject(0xc054c1ba78, 0xc003d79f28, {0xa81d00, 0xc00046e080}, {0xa7cdf8, 0xc01b581600})
        /root/workspace/github.com/goref/pkg/proc/objects.go:69 +0xfd
github.com/cloudwego/goref/pkg/proc.(*ObjRefScope).findRef(0xc054c1ba78, 0xc0280866e0, 0x0)
        /root/workspace/github.com/goref/pkg/proc/objects.go:168 +0xe77
github.com/cloudwego/goref/pkg/proc.ObjectReference(0xc00012e0f0, {0x9c2328, 0x7})
        /root/workspace/github.com/goref/pkg/proc/objects.go:449 +0xc35
github.com/cloudwego/goref/cmd/grf/cmds.execute(0xd2e2, {0x0, 0x0}, {0x0, 0x0}, {0x9c2328, 0x7}, 0xc00018e500)
        /root/workspace/github.com/goref/cmd/grf/cmds/commands.go:139 +0x2ac
github.com/cloudwego/goref/cmd/grf/cmds.attachCmd(0xc0001ba100?, {0xc00003ebc0?, 0x1?, 0x9beeb9?})
        /root/workspace/github.com/goref/cmd/grf/cmds/commands.go:108 +0xf9
github.com/spf13/cobra.(*Command).execute(0xc000027208, {0xc00003eb70, 0x1, 0x1})
        /root/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:987 +0xab1
github.com/spf13/cobra.(*Command).ExecuteC(0xc000026f08)
        /root/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).Execute(0xc0000061c0?)
        /root/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039 +0x13
main.main()
        /root/workspace/github.com/goref/cmd/grf/main.go:22 +0x1a

I manually disable the cache by set the cacheEnabled in the source code to false, then I got this panic:

2024-07-23T06:12:05Z error layer=debugger could not resolve parametric type of s: parametric type without a dictionary
2024-07-23T06:12:05Z error layer=debugger could not resolve parametric type of s: parametric type without a dictionary
panic: runtime error: index out of range [32] with length 32

goroutine 1 [running]:
github.com/cloudwego/goref/pkg/proc.(*HeapScope).readType(0xc0205da1e0, 0xc01c0da9c0, 0xc003e64734, 0xc003d79c88, 0xc003d7a200)
        /root/go/pkg/mod/github.com/cloudwego/goref@v0.0.0-20240722091010-3519d085465e/pkg/proc/heap.go:352 +0x34d
github.com/cloudwego/goref/pkg/proc.(*HeapScope).copyGCMask(0xc0205da1e0, 0xc01c0da9c0, 0xc003d79c80)
        /root/go/pkg/mod/github.com/cloudwego/goref@v0.0.0-20240722091010-3519d085465e/pkg/proc/heap.go:302 +0x96
github.com/cloudwego/goref/pkg/proc.(*ObjRefScope).findObject(0xc05e20fa78, 0xc003d79f28, {0xa81b40, 0xc01fb18340}, {0xa7cc38, 0xc005aee480})
        /root/go/pkg/mod/github.com/cloudwego/goref@v0.0.0-20240722091010-3519d085465e/pkg/proc/reference.go:69 +0xf5
github.com/cloudwego/goref/pkg/proc.(*ObjRefScope).findRef(0xc05e20fa78, 0xc00726e0f0, 0x0)
        /root/go/pkg/mod/github.com/cloudwego/goref@v0.0.0-20240722091010-3519d085465e/pkg/proc/reference.go:168 +0xe77
github.com/cloudwego/goref/pkg/proc.ObjectReference(0xc0000bc0f0, {0x9c21e8, 0x7})
        /root/go/pkg/mod/github.com/cloudwego/goref@v0.0.0-20240722091010-3519d085465e/pkg/proc/reference.go:449 +0xc35
github.com/cloudwego/goref/cmd/grf/cmds.execute(0xd2e2, {0x0, 0x0}, {0x0, 0x0}, {0x9c21e8, 0x7}, 0xc000198500)
        /root/go/pkg/mod/github.com/cloudwego/goref@v0.0.0-20240722091010-3519d085465e/cmd/grf/cmds/commands.go:139 +0x2ac
github.com/cloudwego/goref/cmd/grf/cmds.attachCmd(0xc0001e0100?, {0xc000118b80?, 0x1?, 0x9bed79?})
        /root/go/pkg/mod/github.com/cloudwego/goref@v0.0.0-20240722091010-3519d085465e/cmd/grf/cmds/commands.go:108 +0xf9
github.com/spf13/cobra.(*Command).execute(0xc000126f08, {0xc000118b30, 0x1, 0x1})
        /root/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:987 +0xab1
github.com/spf13/cobra.(*Command).ExecuteC(0xc000126c08)
        /root/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).Execute(0xc0000061c0?)
        /root/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039 +0x13
main.main()
        /root/go/pkg/mod/github.com/cloudwego/goref@v0.0.0-20240722091010-3519d085465e/cmd/grf/main.go:22 +0x1a

Expected behavior

goref can generate the flamegraph

Screenshots

If applicable, add screenshots to help explain your problem.

Goref version:

The master commit.

Environment:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/root/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/root/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.2.linux-amd64'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.2.linux-amd64/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.22.2'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/root/go/pkg/mod/github.com/cloudwego/goref@v0.0.0-20240722091010-3519d085465e/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1310706020=/tmp/go-build -gno-record-gcc-switches'

Additional context

I am not sure if this issue is related to the large memory usage (over 500MB RES usage). If so is there any advice to how to scan the memory usage for application with lots of memory inuse?

I am not sure if this issue is related to the large memory usage (over 500MB RES usage). If so is there any advice to how to scan the memory usage for application with lots of memory inuse?

500Mb of res memory is ok. Theoretically, goref has no memory usage limitations. This problem is likely due to a bug in the support for go1.22 allocation header feature. I will analyze this issue. You can temporarily use "GOEXPERIMENT=noallocheaders" during compilation to remove the allocation header feature and analyze it first.

Thanks for the reply.

"GOEXPERIMENT=noallocheaders"

After enabling this option, the goref no longer crashes. However, this time it fails to complete within 30 minutes. Additionally, the goref process consistently utilizes 1.5 CPU cores, despite the presence of 8 cores on my machine. It appears that I am stuck in a loop or some sort of repetitive process.

And this message still exist:

2024-07-23T06:12:05Z error layer=debugger could not resolve parametric type of s: parametric type without a dictionary
2024-07-23T06:12:05Z error layer=debugger could not resolve parametric type of s: parametric type without a dictionary

Could you give me a executable file and a core file generated by gcore command? I'd like to reproduce it in my environment.
Compress by tar -zcvf issue12 ./exec ./core.xx before send them.

It may also be because the cache is closed. You can open the cache to try again.

Ah, it's my fault. After enable the cache it runs to finish in seconds. Thanks!

Could you please provide the executable file and the core file? I can't reproduce the issue on my testing service. Or may I ask have you changed the go source code? Since I think it's unlikely to occur to the original go code.

May I ask which version of Go did you use to compile k3s? @oilbeater

@jayantxie I push a fork of k3s with my edit here https://github.com/oilbeater/k3s/commit/22433fe8e025501a6ad0ff057009aed6c0f650f5

It use 1.22.4 to compile. You can try the build step here https://github.com/oilbeater/k3s/blob/main/BUILDING.md with

mkdir -p build/data && make download && make generate
SKIP_VALIDATE=true make

then

./dist/artifacts/k3s server

to start the server.

The k3s use lots of build options and zstd to optimize binary size, maybe some options conflict with goref.

Could you please provide the executable file and the core file? That's should be the fastest way for me to reproduce it.
image
I tried to build it in my devbox, but it failed...

I upload the binary and core here https://github.com/oilbeater/k3s/releases/tag/issue

go build -ldflags="-s -w"
Did you build k3s with such build flags? If so, you could remove it since we can't get debug info from the executable file :(

Looks like there are some changes from the upstream I don't know that affect the build. I try again to remove the ldflags and add GOEXPERIMENT=noallocheaders here https://github.com/oilbeater/k3s/releases/download/issue/issue12

It should contains the debug_info now:

file k3s
k3s: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, Go BuildID=CDXATxDA4Z1gGwhpmDVa/9QqoWhX95GDV5BzCafog/1Lk0SLTATqRJBUM99qa6/o8REHiUI9JW65O03UrvW, with debug_info, not stripped

However this time I run goref and it can not run to finish again.

image A very strange issue, the memory values of runtime variable `mheap_` are incorrect.

I guess the executable file you sent should not correspond to the core file.

Sorry, I was still trying to modify the build option when dump the core to make goref work. Here is the new core file https://github.com/oilbeater/k3s/releases/download/issue/issue13

Fixed, you could try again.