cmux leaks memory over long term use
chandradeepak opened this issue · 21 comments
The leak happened again. The same condition .
I did a pprof this is what i found.
(cmux.serve) -> HTTP2MatchHeaderFieldSendSettings->matchHTTP2Field->ReadFrame->readFrameHeader->io.ReadFull->ReadAtLeast->bufferedReader.Read->s.buffer.Write->bytes.(*Buffer).grow->bytes.makeSlice
same path as before.
Do you think a rouge client causes a connection leak even after the connection is closed. or a lot of rouge clients trying to connect but since due to bad authorization the connection is rejected but having their BufferReader bytes never freed by garbage collector even after connection being rejected.
The current usage i have seen is a inuse space not a alloc space that means it is still there not being collected by go garbage collector . One thing that is hard to reproduce is when does this happen. The last fix you gave would terminate any bad connections immediately since we are checking the continuos frame as well. But we have seen leak happening in production and server nodes shutting down after 10 days due to out of memory.
Well, even if the code path is the major source of allocations in pprof, it doesn't really mean there is a legit leak.
Having said that, I can have two explanations:
- A client or a collection of clients are sending frames that are neither settings nor header.
- A client or a collection of clients are sending very large frames.
I'm not sure which one is the root-cause. Can you please apply the following (hacky) patch to your project and test if you see a major change? Feel free to increase the 64K and 128 hardcoded values if needed.
diff --git a/matchers.go b/matchers.go
index 485ede8..74a3d10 100644
--- a/matchers.go
+++ b/matchers.go
@@ -146,6 +146,7 @@ func matchHTTP2Field(w io.Writer, r io.Reader, name, value string) (matched bool
done := false
framer := http2.NewFramer(w, r)
+ framer.SetMaxReadFrameSize(65536)
hdec := hpack.NewDecoder(uint32(4<<10), func(hf hpack.HeaderField) {
if hf.Name == name {
done = true
@@ -154,7 +155,7 @@ func matchHTTP2Field(w io.Writer, r io.Reader, name, value string) (matched bool
}
}
})
- for {
+ for i := 0; i < 128; i++ {
f, err := framer.ReadFrame()
if err != nil {
return false
@@ -181,4 +182,5 @@ func matchHTTP2Field(w io.Writer, r io.Reader, name, value string) (matched bool
return matched
}
}
+ return matched
}
Yeah we have Java clients which connect. In the second scenario when you say large frame what is the size you expect them to be. I don't expect that to happen, because the memory growth is over a time frame , typically over a span of 15 days. So can we rule that out ?
Coming to clients which are neither sending setting nor header framer , i am suspecting may be some java clients are behaving like that. i couldn't reproduce the scenario with go clients in my tests.
Thanks for the workaround. I will try it and let you know. But it would take a while before i figure out since memory growth happens slowly .
Hello soheilhy ,
I have more details about when this memory growth might happen since we see it consistently in our production. I tried using the above fix but that makes all the client having their connections rejected. That opened a interesting aspect by all the clients are sending the correct headers frames so late after 128 frames.
So we tried simulating that using the sample grpc hello world program and these were our observations. Please download the latest sample client and server from
https://github.com/chandradeepak/memgrowth.
I have your fix of code with couple of print statements in
memgrowth/server/vendor/github.com/soheilhy/cmux/matchers.go
I tweaked to found the when i would be able to find the frame when multiple connections happen. so i am doing a for ever loop but once it finds the frame i print the count of it. with multiple connections we see there is a increase in the count of loop when correct frame would appear.
And i have seen the count go up to 1000K as well. If you consider if for one connection it reads 1MM frames you can expect there would be memory growth and that would leak .
One other notable mention here is i didn't see this with the sample hello world pb. So i just added a new rpc with bidirectional stream called send. That triggered it.
That is there in the directory server/vendor/google.golang.org/grpc/examples/helloworld/helloworld/helloworld.proto
rpc Send (stream HelloRequest) returns (stream HelloReply) {}
we have a pool of connections which create multiple streams and send data one after another , assuming we a new bufferedReader for each MuxConn. So multiple connections shouldn't impact each other.
please run the sample client and server to find out more. I am not sure where the problem is . might be in grpc .
Please let us know what you think and let me know if you need help in reproducing.
One thing i found out is in the
func matchHTTP2Field(w io.Writer, r io.Reader, name, value string) (matched bool) { case *http2.SettingsFrame: if !f.IsAck(){ if err := framer.WriteSettings(); err != nil { return false } } }
if we do ignore the settings frame with acks we can get rid of what i saw earlier. seems like we are trying to accept setting frames with acks as well which caused this . But setting frames are no payload frames so that doesn't answer the memory increase . still debugging
@chandradeepak , any chance #42 is also related?
@tamalsaha, Yeah might be we also saw high cpu usage when memory growth was also high. From your call graph it seems most of the time it spend reading the setting frame , writing setting frame and writing setting frame ack. So it might be related to what i see .
I'm still trying to understand if this is expected or not. After all, I need to dump the setting frame, and that consumes more CPU and memory.
Can you guys test what happens if we changeWriteSettings
to WriteSettingsAck
in this line:
Line 165 in bf4a8ed
sure i will test it and let you know.
Is there any progress on this issue?
We have stopped using cmux because of the cpu usage issue.
Sorry to hear that, but you are using m.MatchWithWriters(...)
and that's the expected consequence. you're problem will go away, if you use m.Match
.
The issue is we have Java client and that needs MatchWithWriters
.
Hi
in our case i deployed the fix in some nodes , but due to load balancing the traffic is not routed to the nodes where the fix is there . I will come back with the results once we get real traffic . sorry for the delay.
Thanks Chandra!
@soheilhy , do you know/think why Java clients need that settings frame? Do you think it will be worth filing bug to remove this requirements with the java-grpc repo?
@tamalsaha it's not needed to be implemented the way Java gRPC did, and it's standard conformant. It's really worth trying to fix it there. Thanks!
I filed a bug and cc-ed you. We will see what they say. Thanks!
I just merged a patch which might explain why you see the leak. @chandradeepak, could you try with head, again, please?
Sure @soheilhy i will try it out. Recently we haven't been seeing the issue and it rarely occurs in some machines . so i will deploy this fix in the next push of ours and let you know if that fixes the problem. But it takes some time before it reproduces .
Thanks for the fix .
we don't see the memory leak any more . even if it is leaking else where the reason pprof is showing as cmux is because that is the where the allocation happens when a new request comes. But now we dont see the leak any more . we set both framer.SetMaxReadFrameSize(65536) and looped only 128 times as you suggested before.
i wonder they have some impact.
Thanks for your patience.
Thank you @chandradeepak for the confirmation. I have a hunch that commit 0068a46 has resolved the root cause. would you mind reverting framer.SetMaxReadFrameSize(65536) and the 128 limit on the loop for a quick test?