beam-telemetry/telemetry_metrics_statsd

Implicit behavior for Counter metric

connorjacobsen opened this issue · 7 comments

I recently noticed that the reporter throws away the measurement field of Counter metrics. This can lead to some surprising outcomes.

As an illustration, consider the following metric definitions:

# Metrics to be passed to the TelemetryMetricsStatsd reporter.
metrics = [
  Telemetry.Metrics.counter([:foo, :bar, :baz], tags: [:a, :b]),
  Telemetry.Metrics.counter([:foo, :bar, :bat], tags: [:b, :c]),
]

Based on my understanding of the current implementation of the handle_event/4 function, the following code would trigger a metric to be published for both of the above metrics (one of which would have the incorrect tags and, I think, may actually cause the handler to detach):

:telemetry.execute([:foo, :bar, :baz], %{}, tags: %{a: 1, b: 2})

This can be avoid if you explicitly pass the event_name option, or if you ensure that the default event_name of your metric is unique (ie using the count suffix on everything would do the trick).

I am struggling to quickly reproduce this in the tests for this project (without explicitly specifying event name), but in that case it's simply not publishing anything so it may be a separate problem. I can revisit and try to break things properly over the weekend.

But the offending code seems to be: https://github.com/arkgil/telemetry_metrics_statsd/blob/0f79e2af612caa834ac522da2f6209cd930f5967/lib/telemetry_metrics_statsd/event_handler.ex#L88-L91

The counter here will loop through every metric with the shared prefix and publish a payload without ensuring that the metric in question matches the measurement of the fired event. Contrast this with the Telemetry.Metrics.ConsoleReporter module which will return nil in its fetch_measurement/2 equivalent and not publish if the event_name is the same but the measurement is different: https://github.com/beam-telemetry/telemetry_metrics/blob/v0.3.0/lib/telemetry_metrics/console_reporter.ex#L98

This behavior can be mitigated by post-fixing all counter metrics with.count, but that is not explicitly mentioned in the documentation (that I have been able to find).

If there's something going on that I'm just completely missing, I would be very happy to be wrong here 😄

PS. If you know how to get tests to work without explicitly specifying the event_name option to given_counter, it would help me reproduce this with a test faster. My initial discovery was from a test reporter module I wrote by copy-pasting the bones of this reporter. So while I am not certain this bug exists, I'm very suspicious.

@connorjacobsen You are right, two metrics will be pushed when that event is fired, but this is intentional. The library explicitly says that for counter metrics, the measurement is discarded (somewhere here under the "Counter" section). In your example, two metrics are registered, and both are published when the corresponding event is emitted.

What is the issue that you're facing? Do you think that this behaviour should be documented better? I understand that it might be confusing.

I think the tricky thing is the example given in the Telemetry.Metrics docs append .counter to the metric name, so sometimes people think they just have to have distinct names for it to work, so they'll define metrics like http.client.requests and http.client.connections and think they will be treated as separate metrics.

The discarding the measurement makes sense, but initially it felt like it meant "it doesnt matter what value you use, it will be replaced with 1". But I had to dig into the code more to realize it would throw away the measurement name too.

Does that make sense?

Yes, I see how that might be a problem. @bryannaegele does this issue also apply to Prometheus reporter? I'm curious what's your opinion about it.

That's a good question.

Because of the way Prometheus works I keep references to everything and they're all registered to ETS in some fashion at some point and permanently tracked for aggregation, so the names must be unique. If you attempt to register 2 metrics with the same name, the first one wins and the rest are rejected with an error. Registering two metrics definitions with different names listening on the same event is perfectly fine and should be handled correctly.

I don't know the behavior of trying that across metric types. I'll have to test that, but that should be possible.

For ref, the Prometheus reporter does discard the measurement and increments by 1 in the case of a counter definition. In the case of a sum definition, it does take the value of the specified measurement and adds it. Under the hood, they're both considered a counter in Prometheus.

@arkgil One thing I do slightly differently is in the registration of the handler id. In the example given by @connorjacobsen, the handler id based on your implementation would be {TelemetryMetricsStatsd.EventHandler, #PID<0.621.0>, [:foo, :bar]} for both of the definitions. Since a handler id must be unique, the second definition should never get attached (already exists).

@spec handler_id(:telemetry.event_name(), reporter :: pid) :: :telemetry.handler_id()
defp handler_id(event_name, reporter) do
{__MODULE__, reporter, event_name}
end

In my implementation, I use the metric definitions full name which was provided rather than the parsed event name.

https://github.com/beam-telemetry/telemetry_metrics_prometheus_core/blob/master/lib/core/event_handler.ex#L17-L21

https://github.com/beam-telemetry/telemetry_metrics_prometheus_core/blob/master/lib/core/counter.ex#L17-L21

I'm pretty tired, so I might be missing something, but changing the handler id generation should prevent this from occurring.

@bryannaegele in StatsD case, I group metrics by event name, so there is always a single handler for each unique event under the reporter.

But thank you, it looks like Prometheus follows the same appraoch, i.e. no matter if measurement is there or not, the counter is updated.

I'm wondering what can be done to make it less confusing - it is mentioned in the docs, so maybe they could use some rewording. What do you think @connorjacobsen ?

Some rewording of the docs would probably be sufficient for now?