runtime: fatal error: unknown caller pc

Question

runtime: fatal error: unknown caller pc

karalabe opened this issue 8 years ago · 24 comments

We've received a GC crash report from a Windows 10 user, running our Ethereum client built with Go 1.8.3 (amd64). We've never seen this error ever, so I doubt we can make a repro out of it, but here's the crash logs if you can make heads or tails of it (you can find the full crash dump at ethereum/go-ethereum#14725):

runtime: unexpected return pc for net.(*netFD).writeTo called from 0x80002000720013
fatal error: unknown caller pc

runtime stack:
runtime.throw(0xea8762, 0x11)
        C:/go/src/runtime/panic.go:596 +0x9c
runtime.gentraceback(0xffffffffffffffff, 0xc044903698, 0x0, 0xc0440544e0, 0x0, 0x0, 0x7fffffff, 0xf405e8, 0x45c2f978, 0x0, ...)
        C:/go/src/runtime/traceback.go:317 +0x13f8
runtime.copystack(0xc0440544e0, 0x2000, 0x0)
        C:/go/src/runtime/stack.go:932 +0x377
runtime.shrinkstack(0xc0440544e0)
        C:/go/src/runtime/stack.go:1186 +0x18c
runtime.scanstack(0xc0440544e0, 0xc042015228)
        C:/go/src/runtime/mgcmark.go:762 +0x386
runtime.scang(0xc0440544e0, 0xc042015228)
        C:/go/src/runtime/proc.go:866 +0x210
runtime.markroot.func1()
        C:/go/src/runtime/mgcmark.go:264 +0x7b
runtime.systemstack(0x45c2fda0)
        C:/go/src/runtime/asm_amd64.s:343 +0xb5
runtime.markroot(0xc042015228, 0xc00000006b)
        C:/go/src/runtime/mgcmark.go:269 +0x1e9
runtime.gcDrain(0xc042015228, 0x6)
        C:/go/src/runtime/mgcmark.go:1051 +0xcc
runtime.gcBgMarkWorker.func2()
        C:/go/src/runtime/mgc.go:1530 +0x101
runtime.systemstack(0xc0429f9700)
        C:/go/src/runtime/asm_amd64.s:327 +0x7e
runtime.mstart()
        C:/go/src/runtime/proc.go:1132

goroutine 19 [GC worker (idle)]:
runtime.systemstack_switch()
        C:/go/src/runtime/asm_amd64.s:281 fp=0xc042439f50 sp=0xc042439f48
runtime.gcBgMarkWorker(0xc042014000)
        C:/go/src/runtime/mgc.go:1537 +0x218 fp=0xc042439fd8 sp=0xc042439f50
runtime.goexit()
        C:/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc042439fe0 sp=0xc042439fd8
created by runtime.gcBgMarkStartWorkers
        C:/go/src/runtime/mgc.go:1412 +0x9f

Answer 1 · 2017-06-29T15:17:02.000Z

@aclements?

Answer 2 · 2017-06-29T15:37:42.000Z

The only slightly odd thing I see in the full backtrace is

goroutine 84 [runnable, locked to thread]:
syscall.Syscall6(0x7ffcc6d84ab0, 0x4, 0x54, 0xc044539884,                 0x1, 0xc044539780, 0x0, 0x0, 0x1, 0x0, ...)
        C:/go/src/runtime/syscall_windows.go:174 +0x6b
syscall.(*Proc).Call(0xc0440062a0, 0xc042dd2180, 0x4, 0x4,                 0x20, 0xd36820, 0x1483201, 0xc042dd2180)
        C:/go/src/syscall/dll_windows.go:153 +0x4dc
syscall.(*LazyProc).Call(0xc0421d1e60, 0xc042dd2180, 0x4,                 0x4, 0x1, 0x0, 0x14832e0, 0xc0423560b0)
        C:/go/src/syscall/dll_windows.go:303 +0x66
github.com/ethereum/go-ethereum/vendor/github.com/peterh/l                iner.(*State).readNext(0xc0421daa80, 0xc044539ae0, 0x2, 0x                20, 0xc044539a60)
        C:/gopath/src/github.com/ethereum/go-ethereum/vend                or/github.com/peterh/liner/input_windows.go:178 +0x120

Why is a goroutine running in syscall6 marked as runnable?

We should probably force GOTRACEBACK=system for this crash, so that we can see the runtime functions.

In fact, since this problem seems somewhat repeatable, can we see a stack trace with GOTRACEBACK=system set in the environment?

Answer 3 · 2017-06-29T16:25:26.000Z

I asked for the GOTRACEBACK=system traceback on the original issue.

For reference, here's the trace of the goroutine with the bad return PC:

goroutine 140524 [runnable (scan)]:
net.runtime_pollWait(0x4c442b8, 0x77, 0x0)
        C:/go/src/runtime/netpoll.go:164 +0x60
net.(*pollDesc).wait(0xc0421cefa0, 0x77, 0x0, 0x0)
        C:/go/src/net/fd_poll_runtime.go:75 +0x3f
net.(*ioSrv).ExecIO(0xc0420fa698, 0xc0421cef00, 0xe91b9f, 0x9, 0xf40140, 0x4, 0x4, 0x765f)
        C:/go/src/net/fd_windows.go:196 +0xfe
net.(*netFD).writeTo(0xc0421cee00, 0xc04697a1a0, 0x8b, 0xc3, 0x1481420, 0xc042bd6b40, 0x0, 0x0, 0x0)
        C:/go/src/net/fd_windows.go:561 +0x17f
created by github.com/ethereum/go-ethereum/p2p/discover.(*Table).bondall
        C:/gopath/src/github.com/ethereum/go-ethereum/p2p/discover/table.go:432 +0xac

It, of course, stops when it hits the bad return PC from writeTo (since it's just printing the traceback, it moves on instead of panicking).

Answer 4 · 2017-06-30T01:37:40.000Z

here's the trace of the goroutine with the bad return PC:

FYI. To muddy the waters even more

548 func (fd *netFD) writeTo(buf []byte, sa syscall.Sockaddr) (int, error) {
549         if len(buf) == 0 {
550                 return 0, nil
551         }
552         if err := fd.writeLock(); err != nil {
553                 return 0, err
554         }
555         defer fd.writeUnlock()
556         o := &fd.wop
557         o.InitBuf(buf)
558         o.sa = sa
559         n, err := wsrv.ExecIO(o, "WSASendto", func(o *operation) error {
560                 return syscall.WSASendto(o.fd.sysfd, &o.buf, 1, &o.qty, 0, o.sa, &o.o, nil)
561         })
562         if _, ok := err.(syscall.Errno); ok {
563                 err = os.NewSyscallError("wsasendto", err)
564         }
565         return n, err
566 }

Normal windows syscalls use memory passed to it as parameters during the call only. But syscall.WSASendto here just makes record of "&o.buf" and "&o.o" and returns immediately. Then Windows actually uses memory later to send its contents to the network. Windows signals when it finishes via GetQueuedCompletionStatusEx. So the "o.buf" and "o.o" should not be moved until after line 561. So, hopefully, they are allocated on the heap.

Alex

Answer 5 · 2017-06-30T14:18:21.000Z

@alexbrainman, good call. However, while we're clearly skating on thin ice, it appears to be okay right now:

net/fd_windows.go:548: leaking param: fd
net/fd_windows.go:548:  from fd (passed to call[argument escapes]) at net/fd_windows.go:552

Answer 6 · 2017-06-30T14:27:21.000Z

Got an update with GOTRACEBACK=system on the original thread. The full traceback is in ethereum/go-ethereum#14725 (comment).

This time it's a different, though clearly related failure. It looks like netFD.writeTo is actually attempting to return and crashing. Though, oddly, while the crashing PC is a RET, the faulting PC looks to be valid. If something clobbered the saved return PC, I would have expected for the faulting address and PC to be the same.

unexpected fault address 0xffffffffffffffff
fatal error: fault
[signal 0xc0000005 code=0x0 addr=0xffffffffffffffff pc=0x5cb4f6]

goroutine 7117 [running]:
runtime.throw(0xe8c58f, 0x5)
        C:/go/src/runtime/panic.go:596 +0x9c fp=0xc042c83860 sp=0xc042c83840
runtime.sigpanic()
        C:/go/src/runtime/signal_windows.go:164 +0xff fp=0xc042c83890 sp=0xc042c83860
net.(*netFD).writeTo(0xc042221dc0, 0xc042468340, 0x8b, 0xc3, 0x1481420, 0xc04908e4c0, 0x8b, 0x0, 0x0)
        C:/go/src/net/fd_windows.go:565 +0x1e6 fp=0xc042c83898 sp=0xc042c83890
created by github.com/ethereum/go-ethereum/p2p/discover.(*Table).bondall
        C:/gopath/src/github.com/ethereum/go-ethereum/p2p/discover/table.go:432 +0xac

Leading up to fd_windows.go:565 +0x1e6:

        0x01bf 00447 (.../net/fd_windows.go:565)     MOVQ    CX, "".~r2+176(FP)
        0x01c7 00455 (.../net/fd_windows.go:565)     MOVQ    AX, "".~r3+184(FP)
        0x01cf 00463 (.../net/fd_windows.go:565)     MOVQ    DX, "".~r3+192(FP)
        0x01d7 00471 (.../net/fd_windows.go:565)     PCDATA  $0, $9
        0x01d7 00471 (.../net/fd_windows.go:565)     XCHGL   AX, AX
        0x01d8 00472 (.../net/fd_windows.go:565)     CALL    runtime.deferreturn(SB)
        0x01dd 00477 (.../net/fd_windows.go:565)     MOVQ    112(SP), BP
        0x01e2 00482 (.../net/fd_windows.go:565)     ADDQ    $120, SP
        0x01e6 00486 (.../net/fd_windows.go:565)     RET

Answer 7 · 2017-08-03T18:31:31.000Z

Too late for 1.9.

Answer 8 · 2017-11-28T21:06:36.000Z

I'm not making much progress here, but a few observations:

In the original crash we're shrinking the stack with the net.(*netFD).writeTo call. We crash when walking the new stack to adjust pointers, so the return IP is already corrupted by this point. However, we don't know if it was corrupted by the stack copy, or already corrupted on the original stack (we've already installed the new stack at this point, so the panic traceback is also walking the new stack). I imagine net.(*netFD).writeTo isn't on the stack for very long, which would suggest that the stack copy itself did the corruption, but the second crash suggests that it's not the stack copy.
In the original crash (the one that happened while shrinking net.(*netFD).writeTo's stack), the return PC is 0x80002000720013. That looks an awful lot like an array of small uint16s: [0x13, 0x72, 0x20, 0x80]. Probably not UTF-16 text given the 0x80. Unfortunately, in the second crash it's completely unclear what the corrupted value is, or how we can even crash in that manner on the RET (unless Windows is doing something really fancy like detecting that we just returned to a bad PC and using the branch trace buffer to back us up to a useful PC.. but then failing to tell us the bad PC).
@alexbrainman, I was looking at the docs for WSASendTo and noticed they say "Use NULL for lpNumberOfBytesSent if the lpOverlapped parameter is not NULL to avoid potentially erroneous results." I'm not sure what the "erroneous results" could be, but it looks like we're not following this advice here:

return syscall.WSASendto(o.fd.sysfd, &o.buf, 1, &o.qty, 0, o.sa, &o.o, nil)

Answer 9 · 2017-11-28T21:45:26.000Z

@alexbrainman, I was looking at the docs for WSASendTo and noticed they say "Use NULL for lpNumberOfBytesSent if the lpOverlapped parameter is not NULL to avoid potentially erroneous results." I'm not sure what the "erroneous results" could be, but it looks like we're not following this advice here:

return syscall.WSASendto(o.fd.sysfd, &o.buf, 1, &o.qty, 0, o.sa, &o.o, nil)

I don't see how it is possible to use WSASendTo with lpNumberOfBytesSent set to NULL. From https://msdn.microsoft.com/en-us/library/windows/desktop/ms741693(v=vs.85).aspx - "... Overlapped Socket I/O
If an overlapped operation completes immediately, WSASendTo returns a value of zero and the lpNumberOfBytesSent parameter is updated with the number of bytes sent. If the overlapped operation is successfully initiated and will complete later, WSASendTo returns SOCKET_ERROR and indicates error code WSA_IO_PENDING. In this case, lpNumberOfBytesSent is not updated. When the overlapped operation completes the amount of data transferred is indicated either through the cbTransferred parameter in the completion routine (if specified), or through the lpcbTransfer parameter in WSAGetOverlappedResult. ..." Also see "Example Code" at the end of the doco - it uses lpNumberOfBytesSent just as we do.

Alex

Answer 10 · 2017-12-28T16:56:49.000Z

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

Answer 11 · 2018-04-30T16:56:58.000Z

Not sure if this right place to put this and my apologies if not. I'm seeing a similar error on my machine anytime I attempt to execute "go run". Specifically, the error I'm seeing is runtime: unexpected return pc for runtime.goexit called from 0x0. If this is something that would be helpful I'm happy to provide whatever details you'd like. Thanks!

Answer 12 · 2018-04-30T18:24:46.000Z

@JCShelby That error can have many different root causes. If you're willing I would encourage you to open a new issue about it, ideally with instructions for how to reproduce the problem. Thanks.

Answer 13 · 2018-05-03T01:14:29.000Z

Thanks for the feedback and apologies for any inconvenience. I actually ended up resolving this by reinstalling my environment, thinking it might have been a corrupted installation. That seems to have fixed things but if I see this happen again I’ll be sure to open an issue as directed. Thanks again for the help! Respectfully, Joseph Shelby Sr. Infrastructure Engineer, IT Platform Architecture zoom: https://asurion.zoom.us/my/josephshelby | slack: @JCShelby voice: 615.445.1240 | mobile: 615.210.6139 mail: Asurion I 648 Grassmere Park, Nashville, TN 37211 www.asurion.com<http://www.asurion.com/> I Twitter<https://twitter.com/Asurion> I Facebook<https://www.facebook.com/Asurion> From: Ian Lance Taylor [mailto:notifications@github.com] Sent: Monday, April 30, 2018 13:26 To: golang/go <go@noreply.github.com> Cc: Shelby, Joseph <Joseph.Shelby@asurion.com>; Mention <mention@noreply.github.com> Subject: Re: [golang/go] runtime: fatal error: unknown caller pc (#20846) @JCShelby<https://github.com/jcshelby> That error can have many different root causes. If you're willing I would encourage you to open a new issue about it, ideally with instructions for how to reproduce the problem. Thanks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#20846 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AR1zoMRzloOcGD7Er-lq-AiTCuYse3jLks5tt1crgaJpZM4OI9NU>.

…

________________________________ This message (including any attachments) contains confidential and/or privileged information. It is intended for a specific individual and purpose and is protected by law. If you are not the intended recipient, please notify the sender immediately and delete this message. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited.

Answer 14 · 2019-03-15T05:14:23.000Z

Circling back on this. The original issue was closed by OP. @karalabe - have you seen any more of these crashes with newer Go versions like 1.12 ?

Answer 15 · 2019-03-15T09:29:38.000Z

Haven't seen this since then, so I guess something must have fixed it either on our side or Go side.

Answer 16 · 2019-03-15T14:00:09.000Z

I'm going to close this then. Feel free to re-ping/re-open it if the problem comes back.

Answer 17 · 2019-10-12T14:31:30.000Z

Just got a crash from "fuzzit.dev" where I am running continuous fuzz testing of my compression packages. Go 1.12.10 was used for building.

There is no assembly or "unsafe" involved so there shouldn't be any reasonable way for memory corruption. This fuzzing is also strictly running in a single goroutine, so races also seems unlikely.

That said I have no idea about the hardware stability of the servers running the tests.

Also, a lot of new code has just been added here, so there is a chance of something bad, though I don't know how I would be able to trigger this error

(crash removed, see next comment)

I will see if it reproduces before pinging anyone :)

Answer 18 · 2019-10-24T10:33:56.000Z

I now have a second crash 10 days later (see above). The tested binary is the same.

Both crashes are here: https://gist.github.com/klauspost/d4ec7bd6ecefa1bec56dd8ca4ac8ec39

It is completely different functions that were pre-empted (flate.(*fastGen).matchlenLong) vs. flate.(*decompressor).Read - completely different code. Both crashes were in mgcmark.go:711. I will upgrade the fuzz test to Go 1.13.

Answer 19 · 2019-10-25T16:30:08.000Z

@bradfitz Just had a Fuzz crash on Go 1.13.3 - https://gist.github.com/klauspost/d4ec7bd6ecefa1bec56dd8ca4ac8ec39#file-orgs_klauspost_targets_compress-flate_jobs_k0lulnasftbj78a8iqxf_log-txt

So 3 crashes in total in the same fuzz test, but otherwise seemingly unrelated.

Answer 20 · 2019-10-25T16:39:19.000Z

This looks slightly different. The original issue had problems originating from the syscall in net.(*netFD).writeTo. This seems to bubble up from runtime.gcBgMarkWorker and runtime.gopark.

Perhaps you want to open a new issue ?

Answer 21 · 2019-10-25T16:39:26.000Z

@klauspost, can you file a new bug with that?

Answer 22 · 2019-10-25T16:40:34.000Z

Gah .. crossed responses.

Answer 23 · 2019-10-25T16:57:46.000Z

This looks like the stack has been trashed somehow.
Not only the return address for gopark. gopark's arguments also look trashed. The gcBgMarkWorker failure looks similar, hard to tell for sure if its args are trashed as it has only one arg.

Not sure what might cause this. Could be misuse of unsafe, could be runtime bug (use after free of stack memory?).

It would be really nice to have something more repeatable.

Answer 24 · 2019-10-25T17:07:36.000Z

Added #35158

@randall77 This does not use unsafe and the only assembler would be in the stdlib. Imports here: https://godoc.org/github.com/klauspost/compress/flate?imports

"sync" is only used for a sync.Once and there are no goroutines, so no races should happen either.

Fuzz test imports: https://godoc.org/github.com/klauspost/compress-fuzz/flate?imports