beam-telemetry/telemetry_metrics

Allow counter to support incrementing by measurement value

Shayon opened this issue · 21 comments

In many cases we will want to build a metric for a measurement that is meant to count batches of items, rather than a single item at a time. We should modify the counter function to support this case. I suggest we add a new key to the Counter struct to denote weather we want to increment by 1 or by the value of the specified measurement. I plan to push up a PR along these lines for further discussion.

@Shayon this change will really depend if the different reporters, such as statsd, prometheus, etc can handle those semantics. So please do some research on the general support for this feature in reporters before going ahead with the PR to avoid wasted work. :)

@bryannaegele that makes more sense, as I didn't see anything calling it out specifically (as compared to the documentation for telemetry_metrics_statsd), but I was not certain if the hard coded 1 in the counter here was limiting it to 1 as well

EDIT: we (@Shayon and myself) were planning on adding it as an option to preserve existing functionality

@josevalim good point! I'm looking at this with the specific aim of adding the capability here so that I may then add the capability to the telemetry_metrics_statsd reporter. I know that both Prometheus and Statsd support the capability to increment by more than one, and it is my understanding that the statsd reporter currently ignore the measurement and hard code an increment of 1.
https://github.com/beam-telemetry/telemetry_metrics_statsd/blob/v0.3.0/lib/telemetry_metrics_statsd/formatter/datadog.ex#L26
https://github.com/beam-telemetry/telemetry_metrics_statsd/blob/v0.3.0/lib/telemetry_metrics_statsd/formatter/standard.ex#L38

EDIT: we (@Shayon and myself) were planning on adding it as an option to preserve existing functionality

Yeah, I don't think there's anything to change in this library, it's just a matter of the reporters honoring the user's definitions. The only tricky thing is we skip events when the measurement is missing and in the base case you'd probably want increment by value if the measurement exists, else increment by 1. This is contextual to counters though, so it requires a little more logic.

The tricky thing for the Prometheus reporter is on how to support floats because it requires a very inefficient implementation to support.

If you're mostly focused on getting this in the statsd reporter, you can take a stab at a PR on that reporter.

/cc @arkgil

That is fair, I was attempting to give reporters a means of backwards compatibility. I was thinking that without an option to denote the desired counter behavior it would be more difficult for reporters to support this type of change. But I suppose this could be handling by introducing breaking changes into each reporter individually as they choose to support this type of counter.

While implementing this in the statsd reporter I realized that this is still a breaking change due to the modification of the Counter struct. What would be prefered? Move forward with version bumps?

EDIT: This isn't a breaking change but it does require releasing a new version of telemetry_metrics and updating the version requirement on telemetry_metrics_statsd

The other tricky thing to watch for is the shorthand string automatically translates the last part to a measurement key which may exist but they still want to only increment by 1. If they're lining that name up to an actual measurement, I think they'll have to just figure that out. :/

Yeah, that's one of the reasons that I wanted to add an option that would allow the existing behavior to continue unharmed.

Right. I think it has to be opt-in. Today someone may be counting http.request.status and now they will see their counter increment unreasonably. My suggestion would be to call it increment_by: :foo, where :foo is a measurement key or a fn measurements -> value end function.

Thoughts?

So that would mean that given :telemetry.execute([:kafka, :batch_size], %{count: 100}) you would subscribe with counter("kafka.batch_size.count", increment_by: :count) or counter("kafka.batch_size", increment_by: :count)?

For comparison, the current PR supports it as counter("kafka.batch_size.count", count_by_measurement?: true)

I don't see a problem with the increment_by approach, but it would require retooling how the measurement is pulled out of the Counter measurements, which I think would be pretty straight forward. @Shayon and I could probably tackle that Monday morning, and a corresponding PR up on beam-telemetry/telemetry_metrics_statsd

@josevalim My only counterpoint to your suggestion is that metric options already contains the measurement key. And in the case of increment_by, the measurement key would be oddly unused. This would allow for weird cases where measurement may be :foo and increment_by may be :bar. Overall I think this is just a minor nit in terms of the understandability of the interface, so I'm really happy to go in any direction you all think is appropriate.

@Shayon's point is my biggest concern. I feel like this would introduce a really weird convention when we already have the user specifying the measurement key. It's confusing to say, "The measurement key provided will be used by the reporter when recording the event, unless it's a counter, in which case you also have to set this other option."

We have two issues we're trying to dance around. The first is that a counter only ever increments by an integer of 1 because that's how the reporters were implemented. The second is the shorthand "name string -> event + measurement" which in this case is then automatically specifying a measurement key on something that was just being ignored. The second probably influenced the first in some regard.

I'd rather us make breaking changes in the reporters than to go this route. We're going to have to implement support for this in those either way.

Those are fair points and yeah, that could be confusing to users.

I think if we went with the approach in #69, then it covers both of those concerns.

It gives a way for users to specify if they want the counting by 1 behavior or not, which still feels like a valid choice, as some events are still useful as an increment by 1 counter. Examples of that would be the http.request.status example used in the docs, or counting the number of batches in a scenario like Kafka batch consumption. Other scenarios still would benefit from being able to specify the measurement to report in the counter, such as the batch_size in order to track how many messages are being consumed in total. That means this could be a desirable metric setup:

counter("kafka.consume.batch_size"), # count number of batches
counter("kafka.consume", 
  measurement: :batch_size, 
  count_by_measurement?: true,
  event_name: "kafka.consume.total_messages") # count number of messages

Regarding the potential confusion on the measurement name, using a flag such as count_by_measurement? allows the existing Telemetry.Metrics measurement resolution to be unchanged and prevents adding another source of confusion for it.

Let me throw out one other question to make sure all the bases are covered - why not use Sum in this case? I'm not sure how it's reported in the StatsD reporter, but it's a counter in the Prometheus reporter.

That's a reasonable thought, although in the StatsD reporter Sum is used as the gauge delta. Allowing an implementer to reduce or increase their gauge based on the measurement value. And I don't believe changing that behavior is desirable. Although we can ask @arkgil for his 2 cents.

Good points. I agree that we should not worry about backwards compatibility at this point.

My concern is that not all events will have a count-able measurement. For example, the http request event may not have anything that you can use as a counter. Nor Ecto queries and so on. So we need an easy option to say "this should be bumped by one". It doesn't have to be the default, like it is today, but it should be definitely considered as part of the design.

We could flip the conditional, and use something like increment_per_event to specify that the user wants the existing behavior. As it is currently in #69, counter("http.request.status") would increment by 1

Thank you for the proposal, these are all valid points.

However, it seems like the current sum metric covers what's proposed here, i.e. bumping a metric by a measurement value. Is the fact that it's reported as a gauge by the StatsD reporter the only reason why you would rather not use it? If yes, then I would prefer to leave the current behaviour of Telemetry.Metrics and push the extra complexity to the reporter.

Let me know what you think!

We also have reporter_options, so you can tell Statsd to use sums as something else.

Yes, I was thinking Sum was more for gauges and focused on Counter more since it was already near the behavior I wanted. An example of tunnel vision 😅. I think it makes sense to keep the complexity in the reporter 👍 We'll be creating an issue there. Thanks all!