grafana/carbon-relay-ng

"too many open files" possible when using spooling?

Dieterbe opened this issue · 2 comments

A customer ran into "too many open files" when using basic grafanaNet route and a tcp route with spooling.
TODO:

  1. check out code to look for any obvious file descriptor leaks
  2. add instrumentation - if we don't have any already - to track number of open files (opened by CRNG)

Hypothesis

usage of spooling in TCP routes (or reconnecting connections) leaks (opens many and doesn't close) many file descriptors.

Investigation approach

  • dig in at the destination level, as it owns the TCP connections and spooling
  • don't investigate route, table and other structures. As they are removed from TCP connections and spooling.
  • focus on "normal" operation with a TCP destination running, connections dropping and reconnecting, spool usage etc. Not shutdown of the relay.

Learnings

Diagram of where FD's are created and owned

Destination{
    connUpdate: chan *Conn {
	      conn :*net.TCPConn  // these conns get created by NewConn() which is
                                  // called in dest.updateConn() and are owned by dest.relay()
    }
    spool: Spool {
        queue: nsqd.Diskqueue
    }
}

Connections

  • does Conn close conn: yes upon network error or calling Close(). (*)
    Conn does not touch its writer (which embeds the conn), when closing the conn but that should be ok.

  • updateConn does not close the prior connection, because it does not have access to it.
    updateConn is called from...

  1. destination.relay(), once at startup (when no pre-existing conn), and once if the conn has gone down, at which point it should have been closed (see isAlive check) (*)
  2. when updating the destinations address, when a pre-existing, active conn may be live -> this was a bug that is now fixed: the previous conn was not being closed. #481

(*) these mechanism rely on channels messaging between different routines. If goroutines get stuck, processing may not properly advance. This is best diagnosed with goroutine dumps.

Spool/diskqueue

  • dest.Run creates NewSpool() (and only there do we call NewSpool)
  • dest.Run runs only when route.Run() and AddDest() are called, which are considered "safe" paths.

So then, looking at NewSpool()...
spool stays tied to the destination until dest is shutdown, which does not happen during "normal operation". Likewise, 1 diskqueue is tied to 1 spool for its entire lifetime.
So it comes down to the diskqueue. After some digging, it seems that it correctly opens/closes its file descriptors.

How to avoid in the future?

see diagonstics instructions added in #481
TODO: instrumentation for open FD's?

The two ways I see to instrument open fd's are:

  1. polling /proc/pid/fd and counting the number of files in there.
  2. accounting ourselves, any time we open/close a conn or file.

1 is expensive, 2 is verbose and possibly inaccurate.

I think i will scrap that idea for now.