cmux leaks goroutines
runcom opened this issue · 19 comments
Hey there, we're using cmux to serve HTTP and gRPC together but our application is receiving a very large amount of requests (both gRPC and HTTP). We see tons of goroutines leaking around like:
813 goroutine 754 [IO wait, 215 minutes]:
814 internal/poll.runtime_pollWait(0x7eff86b282b0, 0x72, 0x0)
815 /usr/lib/golang/src/runtime/netpoll.go:173 +0x57
816 internal/poll.(*pollDesc).wait(0xc420329718, 0x72, 0xffffffffffffff00, 0x341d980, 0x340d 888)
817 /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:85 +0xae
818 internal/poll.(*pollDesc).waitRead(0xc420329718, 0xc42172a000, 0x1000, 0x1000)
819 /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:90 +0x3d
820 internal/poll.(*FD).Read(0xc420329700, 0xc42172a000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
821 /usr/lib/golang/src/internal/poll/fd_unix.go:125 +0x18a
822 net.(*netFD).Read(0xc420329700, 0xc42172a000, 0x1000, 0x1000, 0x13ce, 0xc420956000, 0x13 ce)
823 /usr/lib/golang/src/net/fd_unix.go:202 +0x52
824 net.(*conn).Read(0xc4207b2158, 0xc42172a000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
825 /usr/lib/golang/src/net/net.go:176 +0x6d
826 github.com/kubernetes-incubator/cri-o/vendor/github.com/soheilhy/cmux.(*bufferedReader). Read(0xc4207be380, 0xc42172a000, 0x1000, 0x1000, 0xc42098f8c8, 0x74db11, 0xc4202e9688)
827 /root/go/src/github.com/kubernetes-incubator/cri-o/_output/src/github.com/kubern etes-incubator/cri-o/vendor/github.com/soheilhy/cmux/buffer.go:53 +0x144
828 github.com/kubernetes-incubator/cri-o/vendor/github.com/soheilhy/cmux.(*MuxConn).Read(0x c4207be370, 0xc42172a000, 0x1000, 0x1000, 0x0, 0x13ce, 0x1c6e)
829 /root/go/src/github.com/kubernetes-incubator/cri-o/_output/src/github.com/kubern etes-incubator/cri-o/vendor/github.com/soheilhy/cmux/cmux.go:259 +0x4f
830 net/http.(*connReader).Read(0xc4202e9680, 0xc42172a000, 0x1000, 0x1000, 0xc420460380, 0x 1cee4a0, 0xc420460380)
831 /usr/lib/golang/src/net/http/server.go:753 +0x105
832 bufio.(*Reader).fill(0xc4200599e0)
833 /usr/lib/golang/src/bufio/bufio.go:97 +0x11a
834 bufio.(*Reader).ReadSlice(0xc4200599e0, 0x46860a, 0x48d7dd, 0xc42098fa20, 0xfffffffe1104 c0ba, 0xc42098fa60, 0x48da77)
835 /usr/lib/golang/src/bufio/bufio.go:338 +0x2c
836 bufio.(*Reader).ReadLine(0xc4200599e0, 0x100, 0xf8, 0x2157540, 0x468676, 0x22002098faa8, 0xf8)
837 /usr/lib/golang/src/bufio/bufio.go:367 +0x34
838 net/textproto.(*Reader).readLineSlice(0xc4202e96b0, 0xc42098fb20, 0xc42098fb20, 0x41b6d8 , 0x100, 0x2157540)
839 /usr/lib/golang/src/net/textproto/reader.go:55 +0x70
840 net/textproto.(*Reader).ReadLine(0xc4202e96b0, 0xc420750800, 0x0, 0x0, 0xc42098fba8)
841 /usr/lib/golang/src/net/textproto/reader.go:36 +0x2b
842 net/http.readRequest(0xc4200599e0, 0x0, 0xc420750800, 0x0, 0x0)
843 /usr/lib/golang/src/net/http/request.go:925 +0x99
844 net/http.(*conn).readRequest(0xc42096c280, 0x3435900, 0xc4207e4880, 0x0, 0x0, 0x0)
845 /usr/lib/golang/src/net/http/server.go:933 +0x17c
846 net/http.(*conn).serve(0xc42096c280, 0x3435900, 0xc4207e4880)
847 /usr/lib/golang/src/net/http/server.go:1739 +0x50e
848 created by net/http.(*Server).Serve
849 /usr/lib/golang/src/net/http/server.go:2720 +0x288
Do you guys know what's happening?
We can reproduce this fairly easily also, I'm looking into this but if you have something in mind and need some testing, I can give access to a vm which can reproduce the leak.
Thanks for the report. Do you know which cmux commit (git sha1) is vendored in your project?
AFAICT, you're using cmux on a unix socket and the only way to get around this is to add a read timeout to cmux. I'll figure something out for that.
Please try setting a reasonable read timeout:
https://github.com/soheilhy/cmux/blob/master/cmux.go#L99
that should probably fix your issue.
Thanks! I'll try that and report back!
BTW we're using your latest release, which timeout do you think is reasonable though?
It really depends on the characteristics of that unix socket. The one above is blocked on read for 215 minutes. I'd say something about 1-2 minutes should be enough to give up on the unix socket.
It really depends on the characteristics of that unix socket. The one above is blocked on read for 215 minutes. I'd say something about 1-2 minutes should be enough to give up on the unix socket.
thanks I'll try with a 1 minute timeout
unfortunately even setting a read timeout isn't helping:
# TYPE go_goroutines gauge
go_goroutines 1624
All of these go routines are stuck in IO wait? can you send me the stack trace of a couple of them? I'm not sure how to reproduce.
All of these go routines are stuck in IO wait? can you send me the stack trace of a couple of them? I'm not sure how to reproduce.
I can send you a stack trace but they look exactly the same as the one I pasted in the first comment
do you still see "215 minutes" of poll time? if yes, there might be a bug in how we set read timeout in cmux.
do you still see "215 minutes" of poll time? if yes, there might be a bug in how we set read timeout in cmux.
I'll provide more goroutine stacks tomorrow but I'm pretty sure the goroutine counts wasn't lowering after the read timeout (in my case I set 1 minute). So maybe read timeout isn't fully respected.
Oh wait. I just took a deeper look and it seems like that the go-routine isn't leaked by cmux and in fact cmux has already multiplexed the connection to the actual listener. As you can see in the stack trace, the "src/net/textproto/reader.go" is trying to read the proto payload from the connection. The reason that you see cmux in the stack trace is that cmux wraps the connection to pass the sniffed payload first to Read calls. In this case, the sniffed payload was read and the proto library is still trying to read from the connection.
This is exactly why setting the read timeout on cmux isn't effective. The read connection timeout is only for when cmux is reading from the connection.
IMO, this is not a cmux issue but rather an issue on your listener. You probably want to set read timeout when your listener accepts the connection.
Here is the annotated stacktrace:
813 goroutine 754 [IO wait, 215 minutes]:
814 internal/poll.runtime_pollWait(0x7eff86b282b0, 0x72, 0x0)
815 /usr/lib/golang/src/runtime/netpoll.go:173 +0x57
816 internal/poll.(*pollDesc).wait(0xc420329718, 0x72, 0xffffffffffffff00, 0x341d980, 0x340d 888)
817 /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:85 +0xae
818 internal/poll.(*pollDesc).waitRead(0xc420329718, 0xc42172a000, 0x1000, 0x1000)
819 /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:90 +0x3d
// And you're stuck here:
820 internal/poll.(*FD).Read(0xc420329700, 0xc42172a000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
821 /usr/lib/golang/src/internal/poll/fd_unix.go:125 +0x18a
822 net.(*netFD).Read(0xc420329700, 0xc42172a000, 0x1000, 0x1000, 0x13ce, 0xc420956000, 0x13 ce)
823 /usr/lib/golang/src/net/fd_unix.go:202 +0x52
// CMux is simply reading the connection as a wrapper.
824 net.(*conn).Read(0xc4207b2158, 0xc42172a000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
825 /usr/lib/golang/src/net/net.go:176 +0x6d
826 github.com/kubernetes-incubator/cri-o/vendor/github.com/soheilhy/cmux.(*bufferedReader). Read(0xc4207be380, 0xc42172a000, 0x1000, 0x1000, 0xc42098f8c8, 0x74db11, 0xc4202e9688)
827 /root/go/src/github.com/kubernetes-incubator/cri-o/_output/src/github.com/kubern etes-incubator/cri-o/vendor/github.com/soheilhy/cmux/buffer.go:53 +0x144
828 github.com/kubernetes-incubator/cri-o/vendor/github.com/soheilhy/cmux.(*MuxConn).Read(0x c4207be370, 0xc42172a000, 0x1000, 0x1000, 0x0, 0x13ce, 0x1c6e)
829 /root/go/src/github.com/kubernetes-incubator/cri-o/_output/src/github.com/kubern etes-incubator/cri-o/vendor/github.com/soheilhy/cmux/cmux.go:259 +0x4f
830 net/http.(*connReader).Read(0xc4202e9680, 0xc42172a000, 0x1000, 0x1000, 0xc420460380, 0x 1cee4a0, 0xc420460380)
// Bufio is filling the buffer:
831 /usr/lib/golang/src/net/http/server.go:753 +0x105
832 bufio.(*Reader).fill(0xc4200599e0)
833 /usr/lib/golang/src/bufio/bufio.go:97 +0x11a
834 bufio.(*Reader).ReadSlice(0xc4200599e0, 0x46860a, 0x48d7dd, 0xc42098fa20, 0xfffffffe1104 c0ba, 0xc42098fa60, 0x48da77)
835 /usr/lib/golang/src/bufio/bufio.go:338 +0x2c
836 bufio.(*Reader).ReadLine(0xc4200599e0, 0x100, 0xf8, 0x2157540, 0x468676, 0x22002098faa8, 0xf8)
// Proto library is trying to read the line:
837 /usr/lib/golang/src/bufio/bufio.go:367 +0x34
838 net/textproto.(*Reader).readLineSlice(0xc4202e96b0, 0xc42098fb20, 0xc42098fb20, 0x41b6d8 , 0x100, 0x2157540)
839 /usr/lib/golang/src/net/textproto/reader.go:55 +0x70
840 net/textproto.(*Reader).ReadLine(0xc4202e96b0, 0xc420750800, 0x0, 0x0, 0xc42098fba8)
841 /usr/lib/golang/src/net/textproto/reader.go:36 +0x2b
842 net/http.readRequest(0xc4200599e0, 0x0, 0xc420750800, 0x0, 0x0)
843 /usr/lib/golang/src/net/http/request.go:925 +0x99
844 net/http.(*conn).readRequest(0xc42096c280, 0x3435900, 0xc4207e4880, 0x0, 0x0, 0x0)
845 /usr/lib/golang/src/net/http/server.go:933 +0x17c
// The actual HTTP server is serving the connection:
846 net/http.(*conn).serve(0xc42096c280, 0x3435900, 0xc4207e4880)
847 /usr/lib/golang/src/net/http/server.go:1739 +0x50e
848 created by net/http.(*Server).Serve
849 /usr/lib/golang/src/net/http/server.go:2720 +0x288
So the listener is a normal golang unix listener, should I set the read timeout there then?
Yes, exactly. On the connection returned from listener.Accept(), you need to set https://golang.org/pkg/net/#UnixConn.SetReadDeadline
Yes, exactly. On the connection returned from listener.Accept(), you need to set https://golang.org/pkg/net/#UnixConn.SetReadDeadline
wait, how do we set that? Our low is basically:
listener := net.Listen("unix", socketPath)
m := cmux.New(listener)
[...] m.Match( [...]
m.Serve()
Yes, exactly. On the connection returned from listener.Accept(), you need to set https://golang.org/pkg/net/#UnixConn.SetReadDeadline
wait, how do we set that? Our low is basically:listener := net.Listen("unix", socketPath)
m := cmux.New(listener)
[...] m.Match( [...]
m.Serve()
actually, we can set ReadTimeout
on the http.Server
alright, that seems to be draining goroutines, I set it to 5 * time.Second and it seems to work :) Thanks!
Great! :-)