runtime: fatal error: unknown caller pc
karalabe opened this issue · 24 comments
We've received a GC crash report from a Windows 10 user, running our Ethereum client built with Go 1.8.3 (amd64). We've never seen this error ever, so I doubt we can make a repro out of it, but here's the crash logs if you can make heads or tails of it (you can find the full crash dump at ethereum/go-ethereum#14725):
runtime: unexpected return pc for net.(*netFD).writeTo called from 0x80002000720013
fatal error: unknown caller pc
runtime stack:
runtime.throw(0xea8762, 0x11)
C:/go/src/runtime/panic.go:596 +0x9c
runtime.gentraceback(0xffffffffffffffff, 0xc044903698, 0x0, 0xc0440544e0, 0x0, 0x0, 0x7fffffff, 0xf405e8, 0x45c2f978, 0x0, ...)
C:/go/src/runtime/traceback.go:317 +0x13f8
runtime.copystack(0xc0440544e0, 0x2000, 0x0)
C:/go/src/runtime/stack.go:932 +0x377
runtime.shrinkstack(0xc0440544e0)
C:/go/src/runtime/stack.go:1186 +0x18c
runtime.scanstack(0xc0440544e0, 0xc042015228)
C:/go/src/runtime/mgcmark.go:762 +0x386
runtime.scang(0xc0440544e0, 0xc042015228)
C:/go/src/runtime/proc.go:866 +0x210
runtime.markroot.func1()
C:/go/src/runtime/mgcmark.go:264 +0x7b
runtime.systemstack(0x45c2fda0)
C:/go/src/runtime/asm_amd64.s:343 +0xb5
runtime.markroot(0xc042015228, 0xc00000006b)
C:/go/src/runtime/mgcmark.go:269 +0x1e9
runtime.gcDrain(0xc042015228, 0x6)
C:/go/src/runtime/mgcmark.go:1051 +0xcc
runtime.gcBgMarkWorker.func2()
C:/go/src/runtime/mgc.go:1530 +0x101
runtime.systemstack(0xc0429f9700)
C:/go/src/runtime/asm_amd64.s:327 +0x7e
runtime.mstart()
C:/go/src/runtime/proc.go:1132
goroutine 19 [GC worker (idle)]:
runtime.systemstack_switch()
C:/go/src/runtime/asm_amd64.s:281 fp=0xc042439f50 sp=0xc042439f48
runtime.gcBgMarkWorker(0xc042014000)
C:/go/src/runtime/mgc.go:1537 +0x218 fp=0xc042439fd8 sp=0xc042439f50
runtime.goexit()
C:/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc042439fe0 sp=0xc042439fd8
created by runtime.gcBgMarkStartWorkers
C:/go/src/runtime/mgc.go:1412 +0x9f
The only slightly odd thing I see in the full backtrace is
goroutine 84 [runnable, locked to thread]:
syscall.Syscall6(0x7ffcc6d84ab0, 0x4, 0x54, 0xc044539884, 0x1, 0xc044539780, 0x0, 0x0, 0x1, 0x0, ...)
C:/go/src/runtime/syscall_windows.go:174 +0x6b
syscall.(*Proc).Call(0xc0440062a0, 0xc042dd2180, 0x4, 0x4, 0x20, 0xd36820, 0x1483201, 0xc042dd2180)
C:/go/src/syscall/dll_windows.go:153 +0x4dc
syscall.(*LazyProc).Call(0xc0421d1e60, 0xc042dd2180, 0x4, 0x4, 0x1, 0x0, 0x14832e0, 0xc0423560b0)
C:/go/src/syscall/dll_windows.go:303 +0x66
github.com/ethereum/go-ethereum/vendor/github.com/peterh/l iner.(*State).readNext(0xc0421daa80, 0xc044539ae0, 0x2, 0x 20, 0xc044539a60)
C:/gopath/src/github.com/ethereum/go-ethereum/vend or/github.com/peterh/liner/input_windows.go:178 +0x120
Why is a goroutine running in syscall6 marked as runnable?
We should probably force GOTRACEBACK=system for this crash, so that we can see the runtime functions.
In fact, since this problem seems somewhat repeatable, can we see a stack trace with GOTRACEBACK=system set in the environment?
I asked for the GOTRACEBACK=system traceback on the original issue.
For reference, here's the trace of the goroutine with the bad return PC:
goroutine 140524 [runnable (scan)]:
net.runtime_pollWait(0x4c442b8, 0x77, 0x0)
C:/go/src/runtime/netpoll.go:164 +0x60
net.(*pollDesc).wait(0xc0421cefa0, 0x77, 0x0, 0x0)
C:/go/src/net/fd_poll_runtime.go:75 +0x3f
net.(*ioSrv).ExecIO(0xc0420fa698, 0xc0421cef00, 0xe91b9f, 0x9, 0xf40140, 0x4, 0x4, 0x765f)
C:/go/src/net/fd_windows.go:196 +0xfe
net.(*netFD).writeTo(0xc0421cee00, 0xc04697a1a0, 0x8b, 0xc3, 0x1481420, 0xc042bd6b40, 0x0, 0x0, 0x0)
C:/go/src/net/fd_windows.go:561 +0x17f
created by github.com/ethereum/go-ethereum/p2p/discover.(*Table).bondall
C:/gopath/src/github.com/ethereum/go-ethereum/p2p/discover/table.go:432 +0xac
It, of course, stops when it hits the bad return PC from writeTo (since it's just printing the traceback, it moves on instead of panicking).
here's the trace of the goroutine with the bad return PC:
FYI. To muddy the waters even more
548 func (fd *netFD) writeTo(buf []byte, sa syscall.Sockaddr) (int, error) {
549 if len(buf) == 0 {
550 return 0, nil
551 }
552 if err := fd.writeLock(); err != nil {
553 return 0, err
554 }
555 defer fd.writeUnlock()
556 o := &fd.wop
557 o.InitBuf(buf)
558 o.sa = sa
559 n, err := wsrv.ExecIO(o, "WSASendto", func(o *operation) error {
560 return syscall.WSASendto(o.fd.sysfd, &o.buf, 1, &o.qty, 0, o.sa, &o.o, nil)
561 })
562 if _, ok := err.(syscall.Errno); ok {
563 err = os.NewSyscallError("wsasendto", err)
564 }
565 return n, err
566 }
Normal windows syscalls use memory passed to it as parameters during the call only. But syscall.WSASendto here just makes record of "&o.buf" and "&o.o" and returns immediately. Then Windows actually uses memory later to send its contents to the network. Windows signals when it finishes via GetQueuedCompletionStatusEx. So the "o.buf" and "o.o" should not be moved until after line 561. So, hopefully, they are allocated on the heap.
Alex
@alexbrainman, good call. However, while we're clearly skating on thin ice, it appears to be okay right now:
net/fd_windows.go:548: leaking param: fd
net/fd_windows.go:548: from fd (passed to call[argument escapes]) at net/fd_windows.go:552
Got an update with GOTRACEBACK=system on the original thread. The full traceback is in ethereum/go-ethereum#14725 (comment).
This time it's a different, though clearly related failure. It looks like netFD.writeTo is actually attempting to return and crashing. Though, oddly, while the crashing PC is a RET, the faulting PC looks to be valid. If something clobbered the saved return PC, I would have expected for the faulting address and PC to be the same.
unexpected fault address 0xffffffffffffffff
fatal error: fault
[signal 0xc0000005 code=0x0 addr=0xffffffffffffffff pc=0x5cb4f6]
goroutine 7117 [running]:
runtime.throw(0xe8c58f, 0x5)
C:/go/src/runtime/panic.go:596 +0x9c fp=0xc042c83860 sp=0xc042c83840
runtime.sigpanic()
C:/go/src/runtime/signal_windows.go:164 +0xff fp=0xc042c83890 sp=0xc042c83860
net.(*netFD).writeTo(0xc042221dc0, 0xc042468340, 0x8b, 0xc3, 0x1481420, 0xc04908e4c0, 0x8b, 0x0, 0x0)
C:/go/src/net/fd_windows.go:565 +0x1e6 fp=0xc042c83898 sp=0xc042c83890
created by github.com/ethereum/go-ethereum/p2p/discover.(*Table).bondall
C:/gopath/src/github.com/ethereum/go-ethereum/p2p/discover/table.go:432 +0xac
Leading up to fd_windows.go:565 +0x1e6:
0x01bf 00447 (.../net/fd_windows.go:565) MOVQ CX, "".~r2+176(FP)
0x01c7 00455 (.../net/fd_windows.go:565) MOVQ AX, "".~r3+184(FP)
0x01cf 00463 (.../net/fd_windows.go:565) MOVQ DX, "".~r3+192(FP)
0x01d7 00471 (.../net/fd_windows.go:565) PCDATA $0, $9
0x01d7 00471 (.../net/fd_windows.go:565) XCHGL AX, AX
0x01d8 00472 (.../net/fd_windows.go:565) CALL runtime.deferreturn(SB)
0x01dd 00477 (.../net/fd_windows.go:565) MOVQ 112(SP), BP
0x01e2 00482 (.../net/fd_windows.go:565) ADDQ $120, SP
0x01e6 00486 (.../net/fd_windows.go:565) RET
Too late for 1.9.
I'm not making much progress here, but a few observations:
-
In the original crash we're shrinking the stack with the
net.(*netFD).writeTocall. We crash when walking the new stack to adjust pointers, so the return IP is already corrupted by this point. However, we don't know if it was corrupted by the stack copy, or already corrupted on the original stack (we've already installed the new stack at this point, so the panic traceback is also walking the new stack). I imaginenet.(*netFD).writeToisn't on the stack for very long, which would suggest that the stack copy itself did the corruption, but the second crash suggests that it's not the stack copy. -
In the original crash (the one that happened while shrinking
net.(*netFD).writeTo's stack), the return PC is 0x80002000720013. That looks an awful lot like an array of small uint16s: [0x13, 0x72, 0x20, 0x80]. Probably not UTF-16 text given the 0x80. Unfortunately, in the second crash it's completely unclear what the corrupted value is, or how we can even crash in that manner on the RET (unless Windows is doing something really fancy like detecting that we just returned to a bad PC and using the branch trace buffer to back us up to a useful PC.. but then failing to tell us the bad PC). -
@alexbrainman, I was looking at the docs for
WSASendToand noticed they say "Use NULL for lpNumberOfBytesSent if the lpOverlapped parameter is not NULL to avoid potentially erroneous results." I'm not sure what the "erroneous results" could be, but it looks like we're not following this advice here:
return syscall.WSASendto(o.fd.sysfd, &o.buf, 1, &o.qty, 0, o.sa, &o.o, nil)
@alexbrainman, I was looking at the docs for WSASendTo and noticed they say "Use NULL for lpNumberOfBytesSent if the lpOverlapped parameter is not NULL to avoid potentially erroneous results." I'm not sure what the "erroneous results" could be, but it looks like we're not following this advice here:
return syscall.WSASendto(o.fd.sysfd, &o.buf, 1, &o.qty, 0, o.sa, &o.o, nil)
I don't see how it is possible to use WSASendTo with lpNumberOfBytesSent set to NULL. From https://msdn.microsoft.com/en-us/library/windows/desktop/ms741693(v=vs.85).aspx - "... Overlapped Socket I/O
If an overlapped operation completes immediately, WSASendTo returns a value of zero and the lpNumberOfBytesSent parameter is updated with the number of bytes sent. If the overlapped operation is successfully initiated and will complete later, WSASendTo returns SOCKET_ERROR and indicates error code WSA_IO_PENDING. In this case, lpNumberOfBytesSent is not updated. When the overlapped operation completes the amount of data transferred is indicated either through the cbTransferred parameter in the completion routine (if specified), or through the lpcbTransfer parameter in WSAGetOverlappedResult. ..." Also see "Example Code" at the end of the doco - it uses lpNumberOfBytesSent just as we do.
Alex
Timed out in state WaitingForInfo. Closing.
(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)
Not sure if this right place to put this and my apologies if not. I'm seeing a similar error on my machine anytime I attempt to execute "go run". Specifically, the error I'm seeing is runtime: unexpected return pc for runtime.goexit called from 0x0. If this is something that would be helpful I'm happy to provide whatever details you'd like. Thanks!
@JCShelby That error can have many different root causes. If you're willing I would encourage you to open a new issue about it, ideally with instructions for how to reproduce the problem. Thanks.
Circling back on this. The original issue was closed by OP. @karalabe - have you seen any more of these crashes with newer Go versions like 1.12 ?
Haven't seen this since then, so I guess something must have fixed it either on our side or Go side.
I'm going to close this then. Feel free to re-ping/re-open it if the problem comes back.
Just got a crash from "fuzzit.dev" where I am running continuous fuzz testing of my compression packages. Go 1.12.10 was used for building.
There is no assembly or "unsafe" involved so there shouldn't be any reasonable way for memory corruption. This fuzzing is also strictly running in a single goroutine, so races also seems unlikely.
That said I have no idea about the hardware stability of the servers running the tests.
Also, a lot of new code has just been added here, so there is a chance of something bad, though I don't know how I would be able to trigger this error
(crash removed, see next comment)
I will see if it reproduces before pinging anyone :)
I now have a second crash 10 days later (see above). The tested binary is the same.
Both crashes are here: https://gist.github.com/klauspost/d4ec7bd6ecefa1bec56dd8ca4ac8ec39
It is completely different functions that were pre-empted (flate.(*fastGen).matchlenLong) vs. flate.(*decompressor).Read - completely different code. Both crashes were in mgcmark.go:711. I will upgrade the fuzz test to Go 1.13.
@bradfitz Just had a Fuzz crash on Go 1.13.3 - https://gist.github.com/klauspost/d4ec7bd6ecefa1bec56dd8ca4ac8ec39#file-orgs_klauspost_targets_compress-flate_jobs_k0lulnasftbj78a8iqxf_log-txt
So 3 crashes in total in the same fuzz test, but otherwise seemingly unrelated.
This looks slightly different. The original issue had problems originating from the syscall in net.(*netFD).writeTo. This seems to bubble up from runtime.gcBgMarkWorker and runtime.gopark.
Perhaps you want to open a new issue ?
@klauspost, can you file a new bug with that?
Gah .. crossed responses.
This looks like the stack has been trashed somehow.
Not only the return address for gopark. gopark's arguments also look trashed. The gcBgMarkWorker failure looks similar, hard to tell for sure if its args are trashed as it has only one arg.
Not sure what might cause this. Could be misuse of unsafe, could be runtime bug (use after free of stack memory?).
It would be really nice to have something more repeatable.
Added #35158
@randall77 This does not use unsafe and the only assembler would be in the stdlib. Imports here: https://godoc.org/github.com/klauspost/compress/flate?imports
"sync" is only used for a sync.Once and there are no goroutines, so no races should happen either.
Fuzz test imports: https://godoc.org/github.com/klauspost/compress-fuzz/flate?imports