hashicorp/go-metrics

Statsd telemetry doesn't recover from statsd outage

johnrengelman opened this issue · 3 comments

From: hashicorp/vault#1932

It appears that go-metrics doesn't handle a disconnect of the statsd server, particularly if the address changes.

We are running a telegraf agent with a statsd listener and configuring vault to send data to a linked container with a hostname. When the linked container is restarted (generally getting a new IP address), we stop receiving statsd metrics from vault.

armon commented

The basic issue is that UDP provides no feedback. So if statsd dies or the IP is relocated, we can continue to fire-and-forget packets without ever getting an error. Unlike statsite which is over TCP and we get an error and redial. The only potential work around to this would be to periodically just assume the connection is dead and redo the DNS lookup. I'm not sure there is any other robust mechanism given the lack of feedback.

According to this http://serverfault.com/a/416269, if the server side of a UDP socket is disconnected, then there should be an error upon writing (Destination Unreachable), triggered by an ICMP packet.
Testing locally with nc, this is the case; establish a connection, terminate the server, and try writing on the client...packet sniffing shows the ICMP packet, and the nc client exits.

I'll have to do some testing next week to see if I'm seeing the same behavior between Vault and statsd, or if I'm somehow dropping the ICMP package on the network.

armon commented

@johnrengelman That's true! But ICMP is not necessarily reliable. It can be disabled, blocked by firewalls, and is fire-and-forget like UDP as well, so it can be simply dropped. There is a best-effort, but the UDP protocol makes no guarantee!