Crashes on DNS resolvers receiving thousands of requests
jaredledvina opened this issue · 8 comments
Stack Trace:
panic: runtime error: index out of range
goroutine 20 [running]:
panic(0x5ba220, 0xc4200120a0)
/usr/lib/golang/src/runtime/panic.go:500 +0x1a1
main.initLogEntry(0xc4236eb2f0, 0x4, 0x10, 0xc4236eb310, 0x4, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
/builddir/build/BUILD/gopassivedns-aa047e156339bc54f46cf6ccc7fa99f24d94c9d2/gocode/src/github.com/phillipmartin/gopassivedns/main.go:157 +0x541
main.handleDns(0xc421450000, 0xc423560180, 0xc42000c1e0, 0xc4236eb2f0, 0x4, 0x10, 0xc4236eb310, 0x4, 0x10)
/builddir/build/BUILD/gopassivedns-aa047e156339bc54f46cf6ccc7fa99f24d94c9d2/gocode/src/github.com/phillipmartin/gopassivedns/main.go:224 +0x59e
main.handlePacket(0xc42000c300, 0xc42000c1e0, 0x29e8d60800, 0xfffffff207b8a800, 0x1, 0x0)
/builddir/build/BUILD/gopassivedns-aa047e156339bc54f46cf6ccc7fa99f24d94c9d2/gocode/src/github.com/phillipmartin/gopassivedns/main.go:291 +0x423
created by main.doCapture
/builddir/build/BUILD/gopassivedns-aa047e156339bc54f46cf6ccc7fa99f24d94c9d2/gocode/src/github.com/phillipmartin/gopassivedns/main.go:383 +0x278
We have a few DNS resolvers that handle between 2800-3800 requests per minute (according to the logs gopassivedns generates). It's much more frequent on resolves seeing >7500 requests per minute. It seems that after a while and not consistently the process will die and dump the above stack trace. This is not an issue at all on resolves seeing between 200-700 requests per minute. Happy to provide any other information that might be useful!
I believe this is happening due to this section
Lines 376 to 402 in a513e78
If it helps, in our case, our nproc is 8.
interesting, if you can compile a custom version and test, see if changing the channel size at
Line 379 in a513e78
looking at this a little more in depth, I think this is a regression I introduced when I refactored handlePacket to take a channel instead of an individual packet. We turn on ZeroCopy when we capture packets because there is a performance win, however, that means that libpcap is using a circular buffer internally to hold the packets for processing. If the packet capture rate causes the packet being processed to be overwritten before we finish processing it, we can get crazy-looking errors like from your stacktrace because the underlying data changes. An easy way to test this is to comment out https://github.com/Phillipmartin/gopassivedns/blob/master/main.go#L394, re-compile and run the binary on one of these heavy-load systems. If it does not show the same failure, we can look at turning off ZeroCopy or copying the packet data (depending on the perf impact of each strategy)
Increasing the channel buffer as previously mentioned
Line 379 in a513e78
has dramatically improved the stability of gopassivedns on our resolvers doing ~5k qps. However as mentioned above it is not a bullet proof fix.
I will investigate further the recommended NoCopy change to see if that has any direct improvement in uptime vs cpu usage penalties.
@jaredledvina a pull request has recently been merged, in our use case we have seen no crash with significant DNS query load. I would be very interested to know if you have time to see if you see improvements in stability.
Hey @jimmystewpot ,
Oh nice, it's been a while since I've worked with our install of gopassivedns
. I can follow up and see if we can deploy packages for the latest version. Thanks for the ping!
@jaredledvina did you end up seeing any improvements?
We're currently looking at reworking how we've built our gopassivedns
RPM's so, currently I haven't had a chance to get the latest commits into production. Once we iron that out though, I can follow up as well.