DataDog/datadog-go

how to: handling "statsd buffer is full" error?

mhratson opened this issue · 3 comments

How do callers have to handle statsd buffer is full error?

ATM a retry loop does the job, but I wonder if i'm missing anything since the error is private and presumably not supposed to bubble up all the way to the caller.

// retry this error until the buffer is flushed
for i := 3; i > 0; i -= 1 {
	if err := d.Client.ServiceCheck(&sc); err == nil {
		break
	} else {
		// the error is private - comparing strings
		if err.Error() == "statsd buffer is full" {
			time.Sleep(1 * time.Second)
			continue
		}
		d.Logger.Error(err.Error())
		break
	}
}

Thank!

Hi @mhratson,

The client will pack multiple messages into a buffer that is then sent to the Agent. That way we send multiple DSD metrics/events/service_checks at once to the Agent.
This error occurs when a buffer is full. The error is then catched which triggers a flush of the current buffer to the sender. The worker then pulls a new buffer from the internal buffer pool and adds the message that didn't fit in the previous buffer to the new empty one.

Therefore this error should never surface to the user unless your service check is larger than a maximum buffer size.

The maximum size of a buffer is equal to WithMaxBytesPerPayload value which default to 1432 bytes for UDP and named pipe and 8192 bytes for UDS (see this documentation). If you increase this value you need to mimic the same change on the Agent side by setting dogstatsd_buffer_size in the datadog.yaml to a value equal or higher (see this documentation and be careful about packets fragmentation).

In your use case I would first double check why your service check is so large (which I think is the main issue). Also if you're using UDP try to move to UDS (which also offers better performances).
And lastly for your main question: there is no reason to wait, with your current configuration the service check will never fit and it doesn't mean the DogStatsD client is full.

I agree that the error message can be misleading for users, I'll update it in the next version. Thanks for bringing this up !

Therefore this error should never surface to the user unless your service check is larger than a maximum buffer size.

Yeah, in which case user has to handle it and keeping it unexported doesn't help as there's no way to compare the error.

While it's not a big problem and caller can still compare error strings it's a fragile approach that I think still worth noting/improving.
As well as documenting thee limits in ServiceCheck

// A ServiceCheck is an object that contains status of DataDog service check.
type ServiceCheck struct {
// Name of the service check. Required.
Name string
// Status of service check. Required.
Status ServiceCheckStatus
// Timestamp is a timestamp for the serviceCheck. If not provided, the dogstatsd
// server will set this to the current time.
Timestamp time.Time
// Hostname for the serviceCheck.
Hostname string
// A message describing the current state of the serviceCheck.
Message string
// Tags for the serviceCheck.
Tags []string
}

Thanks!

I opened a PR regarding the error wording: #252