soheilhy/cmux

cmux leaks goroutines

runcom opened this issue · 19 comments

Hey there, we're using cmux to serve HTTP and gRPC together but our application is receiving a very large amount of requests (both gRPC and HTTP). We see tons of goroutines leaking around like:

   813 goroutine 754 [IO wait, 215 minutes]:
   814 internal/poll.runtime_pollWait(0x7eff86b282b0, 0x72, 0x0)
   815         /usr/lib/golang/src/runtime/netpoll.go:173 +0x57
   816 internal/poll.(*pollDesc).wait(0xc420329718, 0x72, 0xffffffffffffff00, 0x341d980, 0x340d       888)
   817         /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:85 +0xae
   818 internal/poll.(*pollDesc).waitRead(0xc420329718, 0xc42172a000, 0x1000, 0x1000)
   819         /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:90 +0x3d
   820 internal/poll.(*FD).Read(0xc420329700, 0xc42172a000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
   821         /usr/lib/golang/src/internal/poll/fd_unix.go:125 +0x18a
   822 net.(*netFD).Read(0xc420329700, 0xc42172a000, 0x1000, 0x1000, 0x13ce, 0xc420956000, 0x13       ce)
   823         /usr/lib/golang/src/net/fd_unix.go:202 +0x52
   824 net.(*conn).Read(0xc4207b2158, 0xc42172a000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
   825         /usr/lib/golang/src/net/net.go:176 +0x6d
   826 github.com/kubernetes-incubator/cri-o/vendor/github.com/soheilhy/cmux.(*bufferedReader).       Read(0xc4207be380, 0xc42172a000, 0x1000, 0x1000, 0xc42098f8c8, 0x74db11, 0xc4202e9688)
   827         /root/go/src/github.com/kubernetes-incubator/cri-o/_output/src/github.com/kubern       etes-incubator/cri-o/vendor/github.com/soheilhy/cmux/buffer.go:53 +0x144
   828 github.com/kubernetes-incubator/cri-o/vendor/github.com/soheilhy/cmux.(*MuxConn).Read(0x       c4207be370, 0xc42172a000, 0x1000, 0x1000, 0x0, 0x13ce, 0x1c6e)
   829         /root/go/src/github.com/kubernetes-incubator/cri-o/_output/src/github.com/kubern       etes-incubator/cri-o/vendor/github.com/soheilhy/cmux/cmux.go:259 +0x4f
   830 net/http.(*connReader).Read(0xc4202e9680, 0xc42172a000, 0x1000, 0x1000, 0xc420460380, 0x       1cee4a0, 0xc420460380)
   831         /usr/lib/golang/src/net/http/server.go:753 +0x105
   832 bufio.(*Reader).fill(0xc4200599e0)
   833         /usr/lib/golang/src/bufio/bufio.go:97 +0x11a
   834 bufio.(*Reader).ReadSlice(0xc4200599e0, 0x46860a, 0x48d7dd, 0xc42098fa20, 0xfffffffe1104       c0ba, 0xc42098fa60, 0x48da77)
   835         /usr/lib/golang/src/bufio/bufio.go:338 +0x2c
   836 bufio.(*Reader).ReadLine(0xc4200599e0, 0x100, 0xf8, 0x2157540, 0x468676, 0x22002098faa8,        0xf8)
   837         /usr/lib/golang/src/bufio/bufio.go:367 +0x34
   838 net/textproto.(*Reader).readLineSlice(0xc4202e96b0, 0xc42098fb20, 0xc42098fb20, 0x41b6d8       , 0x100, 0x2157540)
   839         /usr/lib/golang/src/net/textproto/reader.go:55 +0x70
   840 net/textproto.(*Reader).ReadLine(0xc4202e96b0, 0xc420750800, 0x0, 0x0, 0xc42098fba8)
   841         /usr/lib/golang/src/net/textproto/reader.go:36 +0x2b
   842 net/http.readRequest(0xc4200599e0, 0x0, 0xc420750800, 0x0, 0x0)
   843         /usr/lib/golang/src/net/http/request.go:925 +0x99
   844 net/http.(*conn).readRequest(0xc42096c280, 0x3435900, 0xc4207e4880, 0x0, 0x0, 0x0)
   845         /usr/lib/golang/src/net/http/server.go:933 +0x17c
   846 net/http.(*conn).serve(0xc42096c280, 0x3435900, 0xc4207e4880)
   847         /usr/lib/golang/src/net/http/server.go:1739 +0x50e
   848 created by net/http.(*Server).Serve
   849         /usr/lib/golang/src/net/http/server.go:2720 +0x288

Do you guys know what's happening?

We can reproduce this fairly easily also, I'm looking into this but if you have something in mind and need some testing, I can give access to a vm which can reproduce the leak.

Thanks for the report. Do you know which cmux commit (git sha1) is vendored in your project?

AFAICT, you're using cmux on a unix socket and the only way to get around this is to add a read timeout to cmux. I'll figure something out for that.

Please try setting a reasonable read timeout:
https://github.com/soheilhy/cmux/blob/master/cmux.go#L99
that should probably fix your issue.

Thanks! I'll try that and report back!

BTW we're using your latest release, which timeout do you think is reasonable though?

It really depends on the characteristics of that unix socket. The one above is blocked on read for 215 minutes. I'd say something about 1-2 minutes should be enough to give up on the unix socket.

It really depends on the characteristics of that unix socket. The one above is blocked on read for 215 minutes. I'd say something about 1-2 minutes should be enough to give up on the unix socket.

thanks I'll try with a 1 minute timeout

unfortunately even setting a read timeout isn't helping:

# TYPE go_goroutines gauge
go_goroutines 1624

All of these go routines are stuck in IO wait? can you send me the stack trace of a couple of them? I'm not sure how to reproduce.

All of these go routines are stuck in IO wait? can you send me the stack trace of a couple of them? I'm not sure how to reproduce.

I can send you a stack trace but they look exactly the same as the one I pasted in the first comment

do you still see "215 minutes" of poll time? if yes, there might be a bug in how we set read timeout in cmux.

do you still see "215 minutes" of poll time? if yes, there might be a bug in how we set read timeout in cmux.

I'll provide more goroutine stacks tomorrow but I'm pretty sure the goroutine counts wasn't lowering after the read timeout (in my case I set 1 minute). So maybe read timeout isn't fully respected.

Oh wait. I just took a deeper look and it seems like that the go-routine isn't leaked by cmux and in fact cmux has already multiplexed the connection to the actual listener. As you can see in the stack trace, the "src/net/textproto/reader.go" is trying to read the proto payload from the connection. The reason that you see cmux in the stack trace is that cmux wraps the connection to pass the sniffed payload first to Read calls. In this case, the sniffed payload was read and the proto library is still trying to read from the connection.

This is exactly why setting the read timeout on cmux isn't effective. The read connection timeout is only for when cmux is reading from the connection.

IMO, this is not a cmux issue but rather an issue on your listener. You probably want to set read timeout when your listener accepts the connection.

Here is the annotated stacktrace:

   813 goroutine 754 [IO wait, 215 minutes]:
   814 internal/poll.runtime_pollWait(0x7eff86b282b0, 0x72, 0x0)
   815         /usr/lib/golang/src/runtime/netpoll.go:173 +0x57
   816 internal/poll.(*pollDesc).wait(0xc420329718, 0x72, 0xffffffffffffff00, 0x341d980, 0x340d       888)
   817         /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:85 +0xae
   818 internal/poll.(*pollDesc).waitRead(0xc420329718, 0xc42172a000, 0x1000, 0x1000)
   819         /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:90 +0x3d
 
// And you're stuck here: 

  820 internal/poll.(*FD).Read(0xc420329700, 0xc42172a000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
   821         /usr/lib/golang/src/internal/poll/fd_unix.go:125 +0x18a
   822 net.(*netFD).Read(0xc420329700, 0xc42172a000, 0x1000, 0x1000, 0x13ce, 0xc420956000, 0x13       ce)
   823         /usr/lib/golang/src/net/fd_unix.go:202 +0x52

// CMux is simply reading the connection as a wrapper.  

   824 net.(*conn).Read(0xc4207b2158, 0xc42172a000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
   825         /usr/lib/golang/src/net/net.go:176 +0x6d
   826 github.com/kubernetes-incubator/cri-o/vendor/github.com/soheilhy/cmux.(*bufferedReader).       Read(0xc4207be380, 0xc42172a000, 0x1000, 0x1000, 0xc42098f8c8, 0x74db11, 0xc4202e9688)
   827         /root/go/src/github.com/kubernetes-incubator/cri-o/_output/src/github.com/kubern       etes-incubator/cri-o/vendor/github.com/soheilhy/cmux/buffer.go:53 +0x144
   828 github.com/kubernetes-incubator/cri-o/vendor/github.com/soheilhy/cmux.(*MuxConn).Read(0x       c4207be370, 0xc42172a000, 0x1000, 0x1000, 0x0, 0x13ce, 0x1c6e)
   829         /root/go/src/github.com/kubernetes-incubator/cri-o/_output/src/github.com/kubern       etes-incubator/cri-o/vendor/github.com/soheilhy/cmux/cmux.go:259 +0x4f
   830 net/http.(*connReader).Read(0xc4202e9680, 0xc42172a000, 0x1000, 0x1000, 0xc420460380, 0x       1cee4a0, 0xc420460380)

// Bufio is filling the buffer:

   831         /usr/lib/golang/src/net/http/server.go:753 +0x105
   832 bufio.(*Reader).fill(0xc4200599e0)
   833         /usr/lib/golang/src/bufio/bufio.go:97 +0x11a
   834 bufio.(*Reader).ReadSlice(0xc4200599e0, 0x46860a, 0x48d7dd, 0xc42098fa20, 0xfffffffe1104       c0ba, 0xc42098fa60, 0x48da77)
   835         /usr/lib/golang/src/bufio/bufio.go:338 +0x2c
   836 bufio.(*Reader).ReadLine(0xc4200599e0, 0x100, 0xf8, 0x2157540, 0x468676, 0x22002098faa8,        0xf8)

// Proto library is trying to read the line:

   837         /usr/lib/golang/src/bufio/bufio.go:367 +0x34
   838 net/textproto.(*Reader).readLineSlice(0xc4202e96b0, 0xc42098fb20, 0xc42098fb20, 0x41b6d8       , 0x100, 0x2157540)
   839         /usr/lib/golang/src/net/textproto/reader.go:55 +0x70
   840 net/textproto.(*Reader).ReadLine(0xc4202e96b0, 0xc420750800, 0x0, 0x0, 0xc42098fba8)
   841         /usr/lib/golang/src/net/textproto/reader.go:36 +0x2b
   842 net/http.readRequest(0xc4200599e0, 0x0, 0xc420750800, 0x0, 0x0)
   843         /usr/lib/golang/src/net/http/request.go:925 +0x99
   844 net/http.(*conn).readRequest(0xc42096c280, 0x3435900, 0xc4207e4880, 0x0, 0x0, 0x0)
   845         /usr/lib/golang/src/net/http/server.go:933 +0x17c

// The actual HTTP server is serving the connection:

   846 net/http.(*conn).serve(0xc42096c280, 0x3435900, 0xc4207e4880)
   847         /usr/lib/golang/src/net/http/server.go:1739 +0x50e
   848 created by net/http.(*Server).Serve
   849         /usr/lib/golang/src/net/http/server.go:2720 +0x288

So the listener is a normal golang unix listener, should I set the read timeout there then?

Yes, exactly. On the connection returned from listener.Accept(), you need to set https://golang.org/pkg/net/#UnixConn.SetReadDeadline

Yes, exactly. On the connection returned from listener.Accept(), you need to set https://golang.org/pkg/net/#UnixConn.SetReadDeadline

wait, how do we set that? Our low is basically:

listener := net.Listen("unix", socketPath)
m := cmux.New(listener)
[...] m.Match( [...]
m.Serve()

Yes, exactly. On the connection returned from listener.Accept(), you need to set https://golang.org/pkg/net/#UnixConn.SetReadDeadline
wait, how do we set that? Our low is basically:

listener := net.Listen("unix", socketPath)
m := cmux.New(listener)
[...] m.Match( [...]
m.Serve()

actually, we can set ReadTimeout on the http.Server

alright, that seems to be draining goroutines, I set it to 5 * time.Second and it seems to work :) Thanks!

Great! :-)