dedis/onet

Memory leak/usage with onet.v2

lbarman opened this issue · 8 comments

Hi guys,

PriFi's memory usage is growing with time, to the point where some goroutine crashes/timeout with an "out-of-memory" error (after a few minutes - tens of minutes depending on the config):
a

I used defer profile.Start().Stop() from go get github.com/pkg/profile to understand what's going on, and here are the results:
out.pdf

It seems that it stems from this onet.v2 function in network/tcp.go:

// receiveRaw reads the size of the message, then the
// whole message. It returns the raw message as slice of bytes.
// If there is no message available, it blocks until one becomes
// available.
// In case of an error it returns a nil slice and the error.
func (c *TCPConn) receiveRaw() ([]byte, error) {
	c.receiveMutex.Lock()
	defer c.receiveMutex.Unlock()
	readTimeoutLock.RLock()
	c.conn.SetReadDeadline(time.Now().Add(readTimeout))
	readTimeoutLock.RUnlock()
	// First read the size
	var total Size
	if err := binary.Read(c.conn, globalOrder, &total); err != nil {
		return nil, handleError(err)
	}
	if total > MaxPacketSize {
		return nil, fmt.Errorf("%v sends too big packet: %v>%v",
			c.conn.RemoteAddr().String(), total, MaxPacketSize)
	}
	b := make([]byte, total)
	var read Size
	var buffer bytes.Buffer
	for read < total {
		// Read the size of the next packet.
		readTimeoutLock.RLock()
		c.conn.SetReadDeadline(time.Now().Add(readTimeout))
		readTimeoutLock.RUnlock()
		n, err := c.conn.Read(b)
		// Quit if there is an error.
		if err != nil {
			c.updateRx(4 + uint64(read))
			return nil, handleError(err)
		}
		// Append the read bytes into the buffer.
		if _, err := buffer.Write(b[:n]); err != nil {
			log.Error("Couldn't write to buffer:", err)
		}
		read += Size(n)
		b = b[n:]
	}

	// register how many bytes we read. (4 is for the frame size
	// that we read up above).
	c.updateRx(4 + uint64(read))
	return buffer.Bytes(), nil
}

Especially the buffer.Write(b[:n]) part. But buffer seems reset every time there is a new packet, and b clearly is too.

Any idea on what could go wrong ?

PriFi details:

  • I do send a lot: 10Mbps of incoming data is not surprising
  • I do store this data (up to some point) and I free it, this seems to work since it happens in the functions on the right columns of the pdf file above, which show reasonable memory usage.

Additional "details" of the memory usage - if you can make sense of it. tdr: the heap gets filled and starts swapping at some point
b

Looking at network/tcp.go I don't think this has been fixed. What would be awesome is a single TestReceiveRaw that shows the memory leak - then it'll be much easier to fix. I'm not sure whether to point to go's garbage collection (would be the first time in 4 years), or if the memory really is never released in onet and this is why it's not collected.

Also, if it is garbage collection, it might be interesting to test with the latest go-version and see if it is still there...

Some preliminary thoughts:

  1. I agree that reproducing with the latest Go is always a good idea.
  2. it would be simple to rewrite that function to not use bytes.Buffer.Write()
  3. It should be easy to reproduce the behaviour of "open one tcp connection and send thousands of messages across it". However, doing that in isolation from the downstream users of the message might not show the problem; even that it a helpful data point, I guess.

@lbarman please tell me what version of Go you are building with, so I can repro that exactly. Also, would you be willing to run the exact same experiment, but with a copy of receiveRaw that does not use bytes.Buffer? I'll write that and provide it to you first, in my next comment on this issue.

Hi Jeff, thanks for the help. I'm on 1.11.5; I'll try with the latest go version, and also the latest onet (I'm on gopkg.in/onetv2).

I'm not 100% sure the leak comes from onet, I copy'ed the received bytes (so not to keep pointers arounds) and while Onet's memory consumption is still high, it became higher in my code. I'll keep you updated.

I definitely found a bug on my side, which makes me doubt about the memory leak in onet. I will experiment further.

The patch to not use bytes.Buffer:

diff --git a/network/tcp.go b/network/tcp.go
index 6e90cd1..2787af0 100644
--- a/network/tcp.go
+++ b/network/tcp.go
@@ -1,7 +1,6 @@
 package network
 
 import (
-	"bytes"
 	"encoding/binary"
 	"errors"
 	"fmt"
@@ -121,6 +120,7 @@ func (c *TCPConn) receiveRaw() ([]byte, error) {
 	readTimeoutLock.RLock()
 	c.conn.SetReadDeadline(time.Now().Add(readTimeout))
 	readTimeoutLock.RUnlock()
+
 	// First read the size
 	var total Size
 	if err := binary.Read(c.conn, globalOrder, &total); err != nil {
@@ -132,31 +132,24 @@ func (c *TCPConn) receiveRaw() ([]byte, error) {
 	}
 
 	b := make([]byte, total)
-	var read Size
-	var buffer bytes.Buffer
-	for read < total {
-		// Read the size of the next packet.
-		readTimeoutLock.RLock()
-		c.conn.SetReadDeadline(time.Now().Add(readTimeout))
-		readTimeoutLock.RUnlock()
-		n, err := c.conn.Read(b)
+	read := 0
+
+	for read < int(total) {
+		n, err := c.conn.Read(b[read:])
+		read += n
+
 		// Quit if there is an error.
 		if err != nil {
 			c.updateRx(4 + uint64(read))
 			return nil, handleError(err)
 		}
-		// Append the read bytes into the buffer.
-		if _, err := buffer.Write(b[:n]); err != nil {
-			log.Error("Couldn't write to buffer:", err)
-		}
-		read += Size(n)
-		b = b[n:]
 	}
 
 	// register how many bytes we read. (4 is for the frame size
 	// that we read up above).
 	c.updateRx(4 + uint64(read))
-	return buffer.Bytes(), nil
+
+	return b[0:read], nil
 }
 
 // Send converts the NetworkMessage into an ApplicationMessage

(sorry, that patch was done hours ago, but got stuck in my browser)

I have tried to reproduce this and have found that with and without the "no Buffer" patch, I can only reproduce growing memory if I have the code that is receiving messages hold references to them on purpose.

Thanks a lot for investigating.
I can only reproduce growing memory if I have the code that is receiving messages hold references to them on purpose.
Yes that was indeed the problem, some rate-limiting function that was broken. It's fixed and doesn't seem to be leaking:

Figure_1

I got misled by the goprof report, indeed all memory allocations where done in onet, but only because I was passing pointers around afterwards (a good thing). The "leak" was simply some unruly server which did not abide the rate-limiting in some conditions :/ . But this made me find a subtle bug and now performance is much better 🎉

Thanks a lot for the help !

We can close on my side, except if you want me to test the performance with/without the buffer.