creachadair/jrpc2

rare panic in client due to race between `stop` and `handleRequest`

Lachann opened this issue · 5 comments

We've observed infrequent panics like this in our setup, which seem to be coming from inside jrpc2:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x5176d9]

goroutine 22 [running]:
github.com/creachadair/jrpc2.(*Client).handleRequest(0xc0000c6000, 0xc0000c81b0)
	/home/lachann/go/pkg/mod/github.com/creachadair/jrpc2@v0.15.0/client.go:114 +0x59
github.com/creachadair/jrpc2.(*Client).deliver(0xc0000c6000, 0xc0000c81b0)
	/home/lachann/go/pkg/mod/github.com/creachadair/jrpc2@v0.15.0/client.go:127 +0x556
github.com/creachadair/jrpc2.(*Client).accept.func1(0xc0000c6000, 0xc00000e040)
	/home/lachann/go/pkg/mod/github.com/creachadair/jrpc2@v0.15.0/client.go:94 +0x95
created by github.com/creachadair/jrpc2.(*Client).accept
	/home/lachann/go/pkg/mod/github.com/creachadair/jrpc2@v0.15.0/client.go:90 +0x335

We've been unable to reliably replicate the exact scenario leading to this panic, but by examining the code, I believe I can see a possible route in which it might be happening:

The panic happens in handleRequest at this point:
https://github.com/creachadair/jrpc2/blob/default/client.go#L114
It seems to c.ch.Send is called here on nil c.ch. This shouldn't be possible, as the only place that c.ch is set to nil is within stop, and both stop and handleRequest are mutexed with c.mu (handleRequest via deliver and goroutine in accept):
https://github.com/creachadair/jrpc2/blob/default/client.go#L90-L96

However, I think it might be possible that if a request or notification comes from the server just as the client is being stopped, it might start the goroutine in accept, which then correctly blocks on c.mu.Lock(), as Close is holding the lock while stop runs. However, once stop finishes and the mutex unlocks, deliver is immediately called from the goroutine, which then calls handleRequest, which attempts c.ch.Send on c.ch which was just set to nil by stop.

I haven't been able to reliably reproduce this, but I've made a simple, contrived client-server setup (attached client-server.zip) in which I can observe this panic happening live. Note that the client part closes the jrpc2.Client in a goroutine right before attempting a call to the server:

  go func() {
    cli.Close()
  }() 
  
  var res string
  err = cli.CallResult(context.TODO(), "foo.Msg", nil, &res)
  if err != nil {
    fmt.Println("error in call", err)
    os.Exit(1)
  }

99% of the time, this results in a context cancelled error from cli.CallResult and a clean exit of the program. However, when I run the client binary in a tight loop (e.g. while true; do ./client; done), I observe this exact panic every once in a while.

First off, thank you very much for this detailed analysis, and I apologize for the trouble.

Your diagnosis seems to be on the right track, I'll investigate further and update this thread.

The server checks for channel shutdown before delivering responses, but the client didn't originally need to do so, since it only occurs for callbacks. I believe the correct fix is to do as the server does, as represented by this patch:

diff --git a/client.go b/client.go
index 540babf..0ebff4d 100644
--- a/client.go
+++ b/client.go
@@ -109,6 +109,8 @@ func (c *Client) handleRequest(msg *jmessage) {
 		}
 	} else if c.scall == nil {
 		c.log("Discarding callback request: %v", msg)
+	} else if c.ch == nil {
+		c.log("Client channel is closed; discarding callback: %v", msg)
 	} else {
 		bits := c.scall(msg)
 		if err := c.ch.Send(bits); err != nil {

Would you be willing to patch your client.go with this change and see if it still reproduces for you? I have not yet been able to trigger the panic on my system, but I believe this does correctly fix the A-B problem you identified.

Edit: This is patchable as #48.

I've tested the patch in my sample code and it seems to solve the issue. I haven't encountered any panics in a few minutes of running the client in a tight loop like I described, whereas before it would happen every few seconds. I've also seen the client has been stopped error being returned from the call, which I believe is the correct response in that situation and allows for it to be handled cleanly.

Thank you for quick response, as usual :)

Great, thank you. I'll merge this fix, then. I don't really have a good way to add a regression test for it, but if I think of something clever I'll add it later.

The fix is now tagged at and after v0.15.1.