logzio/fluent-plugin-logzio

Spurious bursts of errors connecting to logz.io

edevil opened this issue · 7 comments

Periodically I get bursts of errors from this plugin:

2016-08-03 11:54:19 +0000 [error]: Error connecting to logzio. verify the url https://listener.logz.io:8071?token=...
2016-08-03 11:56:47 +0000 [error]: Error connecting to logzio. verify the url https://listener.logz.io:8071?token=...
2016-08-03 11:58:54 +0000 [error]: Error connecting to logzio. verify the url https://listener.logz.io:8071?token=...
2016-08-03 12:01:12 +0000 [error]: Error connecting to logzio. verify the url https://listener.logz.io:8071?token=...
2016-08-03 12:03:39 +0000 [error]: Error connecting to logzio. verify the url https://listener.logz.io:8071?token=...
2016-08-03 12:15:59 +0000 [error]: Error connecting to logzio. verify the url https://listener.logz.io:8071?token=...
2016-08-03 12:18:16 +0000 [error]: Error connecting to logzio. verify the url https://listener.logz.io:8071?token=...
2016-08-03 12:20:24 +0000 [error]: Error connecting to logzio. verify the url https://listener.logz.io:8071?token=...
2016-08-03 12:22:31 +0000 [error]: Error connecting to logzio. verify the url https://listener.logz.io:8071?token=...
2016-08-03 12:24:38 +0000 [error]: Error connecting to logzio. verify the url https://listener.logz.io:8071?token=...

I see logs on logz.io, so obviously this is not a permanent problem, but it would help if more specific information can be provided on the error. Is it a problem connecting to the remote host? A bad response from the server? A persistent connection that has failed?

Thanks.

Hey @edevil!

This log is printing every time there is an exception while sending logs.
We assumed it can only happen if there is a bad hostname (resulting in getaddrinfo: Name or service not known) - but obviously, we were wrong.

I just pushed another version of the plugin (0.0.5) to RubyGems that prints the exception instead of this not so informative message.

Can you please update your plugin version and comment here again with the exception?

Thanks,
Roi.

2016-08-04 13:29:20 +0000 [debug]: plugin/out_logzio_buffered.rb:44:write: Got flush timeout, containing 409 chunks
2016-08-04 13:29:20 +0000 [debug]: plugin/out_logzio_buffered.rb:54:write: HTTP Response code 200
2016-08-04 13:29:30 +0000 [debug]: plugin/out_logzio_buffered.rb:44:write: Got flush timeout, containing 16 chunks
2016-08-04 13:31:37 +0000 [error]: plugin/out_logzio_buffered.rb:93:rescue in write: Error connecting to logzio. Got exception: too many connection resets (due to Connection timed out - connect(2) for "listener.logz.io" port 8071 - Errno::ETIMEDOUT) after 0 requests on 33641960, last used 1470317497.9291167 seconds ago
2016-08-04 13:31:37 +0000 [debug]: plugin/out_logzio_buffered.rb:44:write: Got flush timeout, containing 182 chunks
2016-08-04 13:31:39 +0000 [debug]: plugin/out_logzio_buffered.rb:54:write: HTTP Response code 200

I should add that this fluentd daemon is running as a daemonset on my Kubernetes cluster hosted on Azure, so there can be some aggressive pruning of idle persistent connections, but Net::HTTP::Persistent is supposed to have a 5s timeout on those...

That seems to be a known issue without a specific solution.
drbrain/net-http-persistent#37
drbrain/net-http-persistent#19

Can you try and change the "flush_interval" parameter?
try to decrease it to 2 seconds, or increase to 30 seconds and see if it's related somehow.

I used tcpdump and I actually see connection attempts (SYN) not being answered from listener.logz.io:8071. Some kind of DoS protection perhaps?

1536 140.689680910    10.2.85.6 ? 52.21.71.179 TCP 76 46088?8071 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=160856420 TSecr=0 WS=128
1537 140.689680910    10.2.85.6 ? 52.21.71.179 TCP 76 [TCP Out-Of-Order] 46088?8071 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=160856420 TSecr=0 WS=128
1538 140.689713610    10.0.3.11 ? 52.21.71.179 TCP 76 46088?8071 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=160856420 TSecr=0 WS=128
1539 141.691624701    10.2.85.6 ? 52.21.71.179 TCP 76 [TCP Retransmission] 46088?8071 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=160857422 TSecr=0 WS=128
1540 141.691624701    10.2.85.6 ? 52.21.71.179 TCP 76 [TCP Retransmission] 46088?8071 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=160857422 TSecr=0 WS=128
1541 141.691672301    10.0.3.11 ? 52.21.71.179 TCP 76 [TCP Retransmission] 46088?8071 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=160857422 TSecr=0 WS=128
1542 143.693602500    10.2.85.6 ? 52.21.71.179 TCP 76 [TCP Retransmission] 46088?8071 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=160859424 TSecr=0 WS=128
1543 143.693602500    10.2.85.6 ? 52.21.71.179 TCP 76 [TCP Retransmission] 46088?8071 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=160859424 TSecr=0 WS=128
1544 143.693632599    10.0.3.11 ? 52.21.71.179 TCP 76 [TCP Retransmission] 46088?8071 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=160859424 TSecr=0 WS=128

@noniperi Can kick in?

This seems to be related to opening connections in a small amount of time. Using a flush_interval of 2s and a constant stream of log traffic I haven't yet encountered problems since it keeps using the same connection. This is because the default value of Net:HTTP:Persistent:idle_timeout 5, so the connection is never considered idle.

Using my previous flush_interval of 10s kept recycling the connection since it was idle for more than "idle_timeout". Is there a way to set this "idle_timeout" via configuration of the plugin?

Version 0.0.6 is live in RubyGems.org

Thanks!