stripe/veneur

Veneur forarding to Datadog - avg higher than max, avg missing data

aniaptebsft opened this issue · 0 comments

Hello

We are using veneur v 14.2.0 to forward metrics to datadog. Our topology looks like this:

Application sends statsd metrics -> to veneur agent running on ECS instances -> veneur proxy -> veneur global -> datadog sink

We use the go-lang datadog statsd libray to send stats (github.com/DataDog/datadog-go/statsd). For this particular problem metric seen in screenshots we use the Statsd.Histogram(...) api to send the timing data.

We are seeing weird issues with AVG for certain metrics. We see that avg is higher than both max and p95. We also see the avg data point missing for significant periods. Attaching a screenshot that show this:
image

Our configuration files are as below:

Veneur client

startup command: ./local/veneurclient -f local/veneurclient.yaml
env: none
veneurclient.yaml config file:

---
# == COLLECTION ==

statsd_listen_addresses:
 - udp://0.0.0.0:8125

# == BEHAVIOR ==

forward_address: "http://veneur-proxy.service.consul:18127"
interval: "10s"
stats_address: "localhost:8125"
http_address: "0.0.0.0:8127"

# == METRICS CONFIGURATION ==

# Defaults to the os.Hostname()!
hostname: ""
# If true and hostname is "" or absent, don't add the host tag
omit_empty_hostname: true

tags: ["service:veneur-local"]

# Set to floating point values that you'd like to output percentiles for from
# histograms.
percentiles: [0.95]
aggregates: ["max","avg","count"]

# == PERFORMANCE ==
num_workers: 6
num_readers: 3

# == LIMITS ==

# How big of a buffer to allocate for incoming metrics. Metrics longer than this
# will be truncated!
metric_max_length: 131072

# How big of a buffer to allocate for incoming traces.
trace_max_length_bytes: 16384

# The size of the buffer we'll use to buffer socket reads. Tune this if you
# you think Veneur needs more room to keep up with all packets.
read_buffer_size_bytes: 26214400

# == DIAGNOSTICS ==

# Enables Go profiling
enable_profiling: true

# == SINKS ==
# == Datadog ==
# Datadog can be a sink for metrics, events, service checks and trace spans.

# Hostname to send Datadog data to.
datadog_api_hostname: "https://app.datadoghq.com"

# API key for acessing Datadog
datadog_api_key: blahyidda

# How many metrics to include in the body of each POST to Datadog. Veneur
# will post multiple times in parallel if the limit is exceeded.
flush_max_per_body: 25000

Veneur proxy

startup command: local/veneur-proxy -f local/veneur-proxy.yaml
env:

        VENEUR_PROXY_FORWARDADDRESS           = "http://veneur-global.service.consul:18127"
        VENEUR_PROXY_HTTPADDRESS              = "0.0.0.0:18127"
        VENEUR_PROXY_STATSADDRESS             = "127.0.0.1:18125"
        VENEUR_PROXY_STATSDLISTENADDRESSES    = "udp://0.0.0.0:18125"
        VENEUR_PROXY_TAGS                     = "service:veneur-proxy"

veneur-proxy.yaml config file:

---
debug: true
enable_profiling: false
http_address: "0.0.0.0:18127"

# How often to refresh from Consul's healthy nodes
consul_refresh_interval: "30s"

# This field is deprecated - use ssf_destination_address instead!
stats_address: "localhost:18125"

# The address to which to send SSF spans and metrics - this is the
# same format as on the veneur server's `ssf_listen_addresses`.
ssf_destination_address: "udp://localhost:8126"

### FORWARDING
# Use a static host for forwarding
forward_address: "http://veneur.example.com"
# Or use a consul service for consistent forwarding.
consul_forward_service_name: "forwardServiceName"

# Maximum time that forwarding each batch of metrics can take;
# note that forwarding to multiple global veneur servers happens in
# parallel, so every forwarding operation is expected to complete
# within this time.
forward_timeout: 10s

### TRACING
# The address on which we will listen for trace data
trace_address: "127.0.0.1:18128"
# Use a static host to send traces to
trace_api_address: "http://localhost:7777"
# Ose us a consul service for sending all spans belonging to the same parent
# trace to a consistent host
consul_trace_service_name: "traceServiceName"

sentry_dsn: ""

Veneur global

startup command: local/veneur -f local/veneur-global.yaml
env:

        VENEUR_AGGREGATES             = "max,avg,count"
        VENEUR_DATADOGAPIHOSTNAME     = "https://app.datadoghq.com"
        VENEUR_DATADOGAPIKEY          = "blahyidda"
        VENEUR_DATADOGTRACEAPIADDRESS = ""
        VENEUR_HOSTNAME               = ""
        VENEUR_HTTPADDRESS            = "0.0.0.0:18127"
        VENEUR_NUMWORKERS             = "3"
        VENEUR_OMITEMPTYHOSTNAME      = "false"
        VENEUR_PERCENTILES            = "0.95"
        VENEUR_STATSDLISTENADDRESSES  = "udp://0.0.0.0:18125"
        VENEUR_TAGS  

veneur-global.yaml config file:

---
# == COLLECTION ==

# The addresses on which to listen for statsd metrics. These are
# formatted as URLs, with schemes corresponding to valid "network"
# arguments on https://golang.org/pkg/net/#Listen. Currently, only udp
# and tcp (including IPv4 and 6-only) schemes are supported.
# This option supersedes the "udp_address" and "tcp_address" options.
statsd_listen_addresses:
 - udp://localhost:18126
 - tcp://localhost:18126

# The addresses on which to listen for SSF data. As with
# statsd_listen_addresses, these are formatted as URLs, with schemes
# corresponding to valid "network" arguments on
# https://golang.org/pkg/net/#Listen. Currently, only UDP and Unix
# domain sockets are supported.
# Note: SSF sockets are required to ingest trace data.
# This option supersedes the "ssf_address" option.
ssf_listen_addresses:
  - udp://localhost:18128
  - unix:///tmp/veneur-ssf.sock

# TLS
# These are only useful in conjunction with TCP listening sockets

# TLS server private key and certificate for encryption (specify both)
# These are the key/certificate contents, not a file path
tls_key: ""
tls_certificate: ""

# Authority certificate: requires clients to be authenticated
tls_authority_certificate: ""



# == BEHAVIOR ==

# Use a static host for forwarding
#forward_address: "http://veneur.example.com"
forward_address: ""

# How often to flush. When flushing to Datadog, changing this
# value when you've already emitted metrics will break your time
# series data.
interval: "10s"

# Veneur can "sychronize" it's flushes with the system clock, flushing at even
# intervals i.e. 0, 10, 20… to align with the `interval`. This is disabled by
# default for now, as it can cause thundering herds in large installations.
synchronize_with_interval: false

# Veneur emits its own metrics; this configures where we send them. It's ok
# to point veneur at itself for metrics consumption!
stats_address: "localhost:18126"

# The address on which to listen for HTTP imports and/or healthchecks.
# http_address: "einhorn@0"
http_address: "0.0.0.0:18127"

# The name of timer metrics that "indicator" spans should be tracked
# under. If this is unset, veneur doesn't report an additional timer
# metric for indicator spans.
indicator_span_timer_name: "indicator_span.duration_ms"

# == METRICS CONFIGURATION ==

# Defaults to the os.Hostname()!
hostname: ""

# If true and hostname is "" or absent, don't add the host tag
omit_empty_hostname: false

# Tags supplied here will be added to all metrics ingested by this instance.
# Example:
# tags:
#  - "foo:bar"
#  - "baz:quz"
tags:
  - ""

# Set to floating point values that you'd like to output percentiles for from
# histograms.
percentiles:
  - 0.5
  - 0.75
  - 0.99

# Aggregations you'd like to putput for histograms. Possible values can be any
# or all of:
# - `min`: the minimum value in the histogram during the flush period
# - `max`: the maximum value in the histogram during the flush period
# - `median`: the median value in the histogram during the flush period
# - `avg`: the average value in the histogram during the flush period
# - `count`: the number of values added to the histogram during the flush period
# - `sum`: the sum of all values added to the histogram during the flush period
# - `hmean`: the harmonic mean of the all the values added to the histogram during the flush period
aggregates:
 - "min"
 - "max"
 - "count"



# == PERFORMANCE ==

# Adjusts the number of workers Veneur will distribute aggregation across.
# More decreases contention but has diminishing returns.
num_workers: 96

# Numbers larger than 1 will enable the use of SO_REUSEPORT, make sure
# this is supported on your platform!
num_readers: 1



# == LIMITS ==

# How big of a buffer to allocate for incoming metrics. Metrics longer than this
# will be truncated!
metric_max_length: 4096

# How big of a buffer to allocate for incoming traces.
trace_max_length_bytes: 16384

# The number of SSF packets that can be processed
# per flush interval
ssf_buffer_size: 16384

# The size of the buffer we'll use to buffer socket reads. Tune this if you
# you think Veneur needs more room to keep up with all packets.
read_buffer_size_bytes: 2097152

# How many metrics to include in the body of each POST to Datadog. Veneur
# will post multiple times in parallel if the limit is exceeded.
flush_max_per_body: 25000



# == DIAGNOSTICS ==

# Sets the log level to DEBUG
debug: true

# Providing a Sentry DSN here will send internal exceptions to Sentry
sentry_dsn: ""

# Enables Go profiling
enable_profiling: false



# == SINKS ==

# == Datadog ==
# Datadog can be a sink for metrics, events, service checks and trace spans.

# Hostname to send Datadog data to.
datadog_api_hostname: https://app.datadoghq.com

# API key for acessing Datadog
datadog_api_key: "farts"

# Hostname to send Datadog trace data to.
datadog_trace_api_address: ""

# == SignalFx ==
# SignalFx can be a sink for metrics.
signalfx_api_key: ""

# Where to send metrics
signalfx_endpoint_base: "https://ingest.signalfx.com"

# The tag we'll add to each metric that contains the hostname we came from
signalfx_hostname_tag: "host"

# == LightStep ==
# LightStep can be a sink for trace spans.

# If present, lightstep will be enabled as a tracing sink
# and this access token will be used
# Access token for accessing LightStep
trace_lightstep_access_token: ""

# Host to send trace data to
trace_lightstep_collector_host: ""

# How often LightStep should reconnect to collectors. If your workload is
# imbalanced — some veneur instances see more spans than others — then you may
# want to reconnect more often.
trace_lightstep_reconnect_period: "5m"

# The LightStep client has internal throttling to prevent you overwhelming
# things. Anything that exceeds this many spans in the reporting period
# — which is a minimum of 500ms and maxmium 2.5s at the time of this writing
# — will be dropped. In other words, you can only submit this many spans per
# flush! If left at zero, veneur will set the maximum to the size of
# `ssf_buffer_size`.
trace_lightstep_maximum_spans: 0

# Multiple clients can be used to load-balance spans cross multiple collectors,
# improving span indexing success rates.
# If missing (or set to zero), it will default
# to a minimum of one client
trace_lightstep_num_clients: 1

# == Kafka ==

# Comma-delimited list of brokers suitable for Sarama's [NewAsyncProducer](https://godoc.org/github.com/Shopify/sarama#NewAsyncProducer)
# in the form hostname:port, such as localhost:9092
kafka_broker: ""

# Name of the topic we'll be publishing checks to
kafka_check_topic: "veneur_checks"

# Name of the topic we'll be publishing events to
kafka_event_topic: "veneur_events"

# Name of the topic we'll be publishing metrics to
kafka_metric_topic: ""

# Name of the topic we'll be publishing spans to
kafka_span_topic: "veneur_spans"

# Name of a tag to hash on for sampling; if empty, spans are sampled based off
# of traceID
kafka_span_sample_tag: ""

# Sample rate in percent (as an integer)
# This should ideally be a floating point number, but at the time this was
# written, gojson interpreted whole-number floats in yaml as integers.
kafka_span_sample_rate_percent: 100

kafka_metric_buffer_bytes: 0

kafka_metric_buffer_messages: 0

#kafka_metric_buffer_frequency: ""

kafka_span_serialization_format: "protobuf"

# The type of partitioner to use.
kafka_partitioner: "hash"

# What type of acks to require for metrics? One of none, local or all.
kafka_metric_require_acks: "all"

# What type of acks to require for span? One of none, local or all.
kafka_span_require_acks: "all"

kafka_span_buffer_bytes: 0

kafka_span_buffer_mesages: 0

#kafka_span_buffer_frequency: ""

# The number of retries before giving up.
kafka_retry_max: 0

# == PLUGINS ==

# == S3 Output ==
# Include these if you want to archive data to S3
aws_access_key_id: ""
aws_secret_access_key: ""
aws_region: ""
aws_s3_bucket: ""

# == LocalFile Output ==
# Include this if you want to archive data to a local file (which should then be rotated/cleaned)
flush_file: ""