Phillipmartin/gopassivedns

Crashes on DNS resolvers receiving thousands of requests

jaredledvina opened this issue · 8 comments

Stack Trace:

panic: runtime error: index out of range

goroutine 20 [running]:
panic(0x5ba220, 0xc4200120a0)
        /usr/lib/golang/src/runtime/panic.go:500 +0x1a1
main.initLogEntry(0xc4236eb2f0, 0x4, 0x10, 0xc4236eb310, 0x4, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /builddir/build/BUILD/gopassivedns-aa047e156339bc54f46cf6ccc7fa99f24d94c9d2/gocode/src/github.com/phillipmartin/gopassivedns/main.go:157 +0x541
main.handleDns(0xc421450000, 0xc423560180, 0xc42000c1e0, 0xc4236eb2f0, 0x4, 0x10, 0xc4236eb310, 0x4, 0x10)
        /builddir/build/BUILD/gopassivedns-aa047e156339bc54f46cf6ccc7fa99f24d94c9d2/gocode/src/github.com/phillipmartin/gopassivedns/main.go:224 +0x59e
main.handlePacket(0xc42000c300, 0xc42000c1e0, 0x29e8d60800, 0xfffffff207b8a800, 0x1, 0x0)
        /builddir/build/BUILD/gopassivedns-aa047e156339bc54f46cf6ccc7fa99f24d94c9d2/gocode/src/github.com/phillipmartin/gopassivedns/main.go:291 +0x423
created by main.doCapture
        /builddir/build/BUILD/gopassivedns-aa047e156339bc54f46cf6ccc7fa99f24d94c9d2/gocode/src/github.com/phillipmartin/gopassivedns/main.go:383 +0x278

We have a few DNS resolvers that handle between 2800-3800 requests per minute (according to the logs gopassivedns generates). It's much more frequent on resolves seeing >7500 requests per minute. It seems that after a while and not consistently the process will die and dump the above stack trace. This is not an issue at all on resolves seeing between 200-700 requests per minute. Happy to provide any other information that might be useful!

I believe this is happening due to this section

gopassivedns/main.go

Lines 376 to 402 in a513e78

/* init channels for the packet handlers and kick off handler threads */
var channels []chan *packetData
for i := 0; i < config.numprocs; i++ {
channels = append(channels, make(chan *packetData, 100))
}
for i := 0; i < config.numprocs; i++ {
go handlePacket(channels[i], logChan, gcIntervalDur, gcAgeDur, i, stats)
}
// Use the handle as a packet source to process all packets
packetSource := gopacket.NewPacketSource(handle, handle.LinkType())
//only decode packet in response to function calls, this moves the
//packet processing to the processing threads
packetSource.DecodeOptions.Lazy = true
//We don't mutate bytes of the packets, so no need to make a copy
//this does mean we need to pass the packet via the channel, not a pointer to the packet
//as the underlying buffer will get re-allocated
packetSource.DecodeOptions.NoCopy = true
/*
parse up to the IP layer so we can consistently balance the packets across our
processing threads
TODO: in the future maybe pass this on the channel to so we don't reparse
but the profiling I've done doesn't point to this as a problem
*/

If it helps, in our case, our nproc is 8.

interesting, if you can compile a custom version and test, see if changing the channel size at

channels = append(channels, make(chan *packetData, 100))
to something much larger (I'd start with 500) makes that go away. I could see this being caused by that pre-allocated channel filling up during spikes.

looking at this a little more in depth, I think this is a regression I introduced when I refactored handlePacket to take a channel instead of an individual packet. We turn on ZeroCopy when we capture packets because there is a performance win, however, that means that libpcap is using a circular buffer internally to hold the packets for processing. If the packet capture rate causes the packet being processed to be overwritten before we finish processing it, we can get crazy-looking errors like from your stacktrace because the underlying data changes. An easy way to test this is to comment out https://github.com/Phillipmartin/gopassivedns/blob/master/main.go#L394, re-compile and run the binary on one of these heavy-load systems. If it does not show the same failure, we can look at turning off ZeroCopy or copying the packet data (depending on the perf impact of each strategy)

Increasing the channel buffer as previously mentioned

channels = append(channels, make(chan *packetData, 100))

has dramatically improved the stability of gopassivedns on our resolvers doing ~5k qps. However as mentioned above it is not a bullet proof fix.

I will investigate further the recommended NoCopy change to see if that has any direct improvement in uptime vs cpu usage penalties.

@jaredledvina a pull request has recently been merged, in our use case we have seen no crash with significant DNS query load. I would be very interested to know if you have time to see if you see improvements in stability.

Hey @jimmystewpot ,

Oh nice, it's been a while since I've worked with our install of gopassivedns. I can follow up and see if we can deploy packages for the latest version. Thanks for the ping!

@jaredledvina did you end up seeing any improvements?

We're currently looking at reworking how we've built our gopassivedns RPM's so, currently I haven't had a chance to get the latest commits into production. Once we iron that out though, I can follow up as well.