logstash-plugins/logstash-input-beats

Better handle exceptions in ConnectionHandler

robbavey opened this issue · 8 comments

ConnectionHandler should implement exceptionCaught to more cleanly and less noisily handle any exceptions thrown, and to avoid them being picked up by the DefaultChannelPipeline.

eg

[2018-07-30T11:11:22,676][WARN ][io.netty.channel.DefaultChannelPipeline] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_131]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_131]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_131]
        at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_131]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[?:1.8.0_131]
        at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) ~[logstash-input-tcp-5.0.9.jar:?]
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1108) ~[logstash-input-tcp-5.0.9.jar:?]
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:345) ~[logstash-input-tcp-5.0.9.jar:?]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126) ~[logstash-input-tcp-5.0.9.jar:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) ~[logstash-input-tcp-5.0.9.jar:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) ~[logstash-input-tcp-5.0.9.jar:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) ~[logstash-input-tcp-5.0.9.jar:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) [logstash-input-tcp-5.0.9.jar:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [logstash-input-tcp-5.0.9.jar:?]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [logstash-input-tcp-5.0.9.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
jsvd commented

This issue doesn't seem to be valid as the exception comes from the tcp input and not the beats input (you can see mentions of the [logstash-input-tcp-5.0.9.jar:?])

@jsvd 🤦‍♂️

Please reopen this, I’ve come across the same error before and it is definitely a problem with logstash-input-beats.

The stack trace likely mentions logstash-input-tcp because Netty is loaded from its JAR file, since all Logstash plugins share a single classloader.

jsvd commented

@praseodym you are right.
I've also confirmed this with tcpkill against logstash 6.4.0 and got:

[2018-09-13T10:05:45,788][INFO ][org.logstash.beats.BeatsHandler] [local: 0:0:0:0:0:0:7f00:1:4444, remote: 0.0.0.144:0] Handling exception: Connection reset by peer
[2018-09-13T10:05:45,788][WARN ][io.netty.channel.DefaultChannelPipeline] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_111]
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_111]
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_111]
	at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_111]
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[?:1.8.0_111]
	at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1108) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:345) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) [netty-all-4.1.18.Final.jar:4.1.18.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-all-4.1.18.Final.jar:4.1.18.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.18.Final.jar:4.1.18.Final]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]

I am seeing this in 6.5.4 does it mean the connection was interrupted or is it filebeat closing the connection properly but logstash is logging something erroneous?

I continue to encounter customers encountering this error on connections from beats to logstash. I have a customer in 6.8. as well as a customer in 7.5.0 reporting these errors. Is there any kind of plan to resolve this problem?

Is the accepted solution simply to set the client inactivity timeout to a very high value?

Any workarounds for this issue? Its is pretty much making my system unusable, because it breaks constantly

This shouldn't be considered a problem on the server-side other than a source of log noise. In this case it should probably be an expected type of exception and should not result in a noisy stack-trace.

It is the client (or potentially some intermediary such as firewall connection-tracking) that is aborting the connection (ie. not a clean 4-way TCP teardown). Some TCP clients may even deliberately do a TCP abort (close without lingering) as a way of tearing down a connection when complete, with the intention of preventing a buildup of connections in FIN_WAIT states.

It may be interpreted as a sign that the client is closing the connection [too early?] or not attempting to keep a connection open (which means it will constantly be having to deal with things like TCP handshakes and slow-start and TLS).