Make metrics filterable based upon metadata

Question

Make metrics filterable based upon metadata

Closed this issue 4 years ago · 26 comments

The "prefix or not to prefix" issue seems to boil down to the inability to allow users to conditionally handle an event based upon a predicate other than the event name. The proposal would be to add a filter predicate option which would be used by reporters to conditionally handle the event.

This should allow for simpler APIs for library authors going forward and help some tracing vendors with their needs by discouraging the use of required prefixes or auto-generated prefixes. User-supplied prefixes should still be accepted for power users with tight performance requirements where auto-instrumentation by vendors is less of a concern.

Example

I am read/write splitting to my DB which requires separate Ecto Repos. The latency requirements for my writes are much different than my requirements for my reads, resulting in different bucket sizes for each.

Currently, I must utilize event prefixes provided by Ecto to distinguish between these repos.

distribution("ecto.writer.query.duration.ms",
  buckets: [5, 10, 20, 50, 100, 200, 500, 1000, 1500, 2000, 5000, 10000],
  event_name: [:writer, :repo, :query],
  measurement: :query_time,
  unit: {:native, :millisecond},
  description: "Ecto query duration - Time spent executing the query",
  tags: [:repo]
),
distribution("ecto.reader.query.duration.ms",
  buckets: [5, 10, 15, 25, 35, 50, 100, 200],
  event_name: [:reader, :repo, :query],
  measurement: :query_time,
  unit: {:native, :millisecond},
  description: "Ecto query duration - Time spent executing the query",
  tags: [:repo]
)

If I could instead conditionally handle an event based upon the metadata, the need for a prefix could be removed in most use cases.

distribution("ecto.writer.query.duration.ms",
  buckets: [5, 10, 20, 50, 100, 200, 500, 1000, 1500, 2000, 5000, 10000],
  event_name: [:repo, :query],
  measurement: :query_time,
  unit: {:native, :millisecond},
  filter: fn meta ->
    match?(%{repo: MyApp.Repo}, meta)
  end,
  description: "Ecto query duration - Time spent executing the query",
  tags: [:repo]
),
distribution("ecto.reader.query.duration.ms",
  buckets: [5, 10, 15, 25, 35, 50, 100, 200],
  event_name: [:repo, :query],
  measurement: :query_time,
  unit: {:native, :millisecond},
  filter: &match?(%{repo: MyApp.ReadOnlyRepo}, &1),
  description: "Ecto query duration - Time spent executing the query",
  tags: [:repo]
)

Reporters should honor the predicate when implementing handlers when the filter is a function. To maintain the highest performance, reporters could add a second event handler which processes metrics defined with a predicate to remove an additional function call, e.g. handle_event/4 and handle_filtered_event/4.

/cc @binaryseed @josevalim @arkgil @GregMefford @tsloughter @hauleth

Answer 1 · 2020-01-14T21:44:52.000Z

I like this idea. It gives enough flexibility to support both same or different prefix cases. Also very useful to report only subset of events as metrics.

To maintain the highest performance, reporters could add a second event handler which processes metrics defined with a predicate to remove an additional function call, e.g. handle_event/4 and handle_filtered_event/4.

@bryannaegele I'm not sure I understand this part - can you explain?

Answer 2 · 2020-01-14T21:51:33.000Z

Simple question. Is that possible to update metric filter after creation? Maybe something like that would be feasible in some cases. If so, then maybe filter should be a list? Maybe we should take a look on how logger did it:

filters is a list of {filter_fun(), Opts :: term()}
there is default_action on each handler which is either log or stop which will be action if none of the filters decides otherwise
filter_fun() takes 2 arguments - event and Opts
filter_fun() can return:
- stop - to prevent event from being logged
- ignore - this filter doesn't do anything and further filters should decide what to do
- event itself (can be modified) - event that will be sent to the handler

Answer 3 · 2020-01-14T22:05:30.000Z

To maintain the highest performance, reporters could add a second event handler which processes metrics defined with a predicate to remove an additional function call, e.g. handle_event/4 and handle_filtered_event/4.

@bryannaegele I'm not sure I understand this part - can you explain?

Sure. I had two concerns about that.

With one handler, the handler would need to check if there is a predicate or provide a default, then execute it on every call. This isn't a huge hit, but it's not zero. However, user supplied predicates will almost always be an anonymous function which does have a measurable hit on performance. This pattern would help alleviate doing checks on every call to segregate known filterable calls to another handler.

I suppose you could provide a default handler in your reporter that is a function capture vs anonymous function.

Answer 4 · 2020-01-14T22:10:17.000Z

Is that possible to update metric filter after creation?

I suppose you could if you used an MFA and your function was somehow dynamic. That doesn't sound like something you'd ever do in practice. What would be the use case?

As to filtering similarly to logger, we're only concerned with the question of "should I record this event with this definition?" which is a simple predicate rather than modifying anything. Modification of metadata is already available in tag_values/1.

That does raise a good question though of whether or not to apply tag_values/1 before filter/1. 🤔

Answer 5 · 2020-01-14T22:15:21.000Z

@bryannaegele thanks for clarification 👍

Answer 6 · 2020-01-14T22:21:04.000Z

Is it implied here that libraries move away from configurable event name? It seemed like in other discussions we're already past the point where we can change that...

Answer 7 · 2020-01-15T13:40:36.000Z

That does raise a good question though of whether or not to apply tag_values/1 before filter/1. 🤔

Squash the two into one. Like logger does. This will solve the ordering problem.

Answer 8 · 2020-01-15T15:58:20.000Z

Is it implied here that libraries move away from configurable event name? It seemed like in other discussions we're already past the point where we can change that...

We could definitely discourage requiring a custom prefix. That would include the major libraries you're probably concerned with. I could possibly be swayed, but I still think power users should be able to prefix in complex cases for maximum performance.

Answer 9 · 2020-01-16T21:54:57.000Z

@bryannaegele is there a particular reason why filtering should be based on metadata only? I imagine measurements could be useful as well, e.g. for counting database queries which take above x milliseconds (although one could argue that that's the goal of distribution/2 😄 ).

Answer 10 · 2020-01-17T18:19:53.000Z

@arkgil I couldn't think of a situation where you'd do that with metrics since, as you mentioned, this is more of a function of your end tooling.

That is a handy technique, however. I use that for slow query logging in our services. Probably don't need this if you have tracing.

def handle_event([_repo, :repo, :query], measures, meta, _config) do
    threshold = Config.query_time_warn_threshold()
    total_time_ms = extract_time(measures, :total_time)

    if total_time_ms > threshold do
      query_time_ms = extract_time(measures, :query_time)
      decode_time_ms = extract_time(measures, :decode_time)
      queue_time_ms = extract_time(measures, :queue_time)
      idle_time_ms = extract_time(measures, :idle_time)

      query_data = %{
        title: "Slow Query",
        query: meta.query,
        params: Enum.map(meta.params, &inspect/1),
        total_time: "#{total_time_ms}ms",
        query_time: "#{query_time_ms}ms",
        decode_time: "#{decode_time_ms}ms",
        queue_time: "#{queue_time_ms}ms",
        idle_time: "#{idle_time_ms}ms",
        data_source: meta.source
      }

      _ =
        Jason.encode!(query_data)
        |> Logger.warn()

      :ok
    end

    :ok
  end

Answer 11 · 2020-01-21T20:40:06.000Z

Right, that makes sense. I'd be for implementing this in Telemetry.Metrics, looks like a useful and generic (as in it can be used across reporters) feature 👍

Answer 12 · 2020-01-21T20:41:28.000Z

That does raise a good question though of whether or not to apply tag_values/1 before filter/1. 🤔

I think we should filter first. Filter is for events, tag_values produces tags which are part of a metric derived from that event.

Answer 13 · 2020-02-05T10:22:05.000Z

Ecto doesn't always have a value for all its metrics so currently I have to set 0 as a default, which obviously skews results, so this would be a great addition. Is this waiting for a PR?

Answer 14 · 2020-02-06T21:46:43.000Z

@bryannaegele what @hkrutzer says would be a valid use case for filtering based on measurements.

@hkrutzer I think we need to flesh out the details before moving into implementation, but we're certainly looking for contributors here 👍

Answer 15 · 2020-02-10T21:54:41.000Z

@arkgil yes, definitely. This was a good use case which I have run into with Ecto, so I think it's good to support filtering on both.

My updated proposal is to change the proposed filter predicate function requirements to accept 2 arguments - measurements and metadata.

distribution("ecto.writer.query.duration.ms",
  buckets: [5, 10, 20, 50, 100, 200, 500, 1000, 1500, 2000, 5000, 10000],
  event_name: [:repo, :query],
  measurement: :query_time,
  unit: {:native, :millisecond},
  filter: fn _measurements, meta ->
    match?(%{repo: MyApp.Repo}, meta)
  end,
  description: "Ecto query duration - Time spent executing the query",
  tags: [:repo]
)

Thoughts?

Answer 16 · 2020-02-11T05:25:56.000Z

@bryannaegele looks great now 👌

Answer 17 · 2020-02-13T17:43:22.000Z

I was working on this last night and realized the described use case is supposed to already be handled by reporters but I fail to do so in the Prometheus one.

Under https://github.com/beam-telemetry/telemetry_metrics/blob/master/docs/writing_reporters.md#reacting-to-events

Measurements are optional, so we must skip reporting that particular measurement if it is not available

If we have this as a standard for reporters and still think it's a good idea (I do), then this is just a bug in the Prometheus reporter.

I don't see any harm with passing measurements to the filter, but I don't see a need to any longer.

The only "harm" I would see is not having a need for it and making filter functions more verbose where a simple capture would suffice, such as filter: &(&match?(%{read_only: true}, &1))

Answer 18 · 2020-02-13T21:06:33.000Z

@bryannaegele isn't the problem in the described use case not that the values are missing, but that they are set to 0, and skew the aggregations?

The only "harm" I would see is not having a need for it and making filter functions more verbose where a simple capture would suffice

I agree. Maybe this requires more thinking.

Answer 19 · 2020-02-13T21:23:35.000Z

They are set to nil

Answer 20 · 2020-02-14T09:18:45.000Z

I agree that we can close this for now as the use case is address. We don't emitting metadata instead of emitting zero on purpose, exactly to avoid skews.

Answer 21 · 2020-02-14T16:19:41.000Z

@josevalim the metadata filtering case is still relevant and removes the propensity of event prefixes. The remaining question was if there's any reason to include measurements as an argument when we can't think of a use case.

Answer 22 · 2020-02-14T16:31:44.000Z

we can always ship with only the meta and in the future if we need the measurements we can check the function arity and dispatch accordingly with both meta and measurements.

Answer 23 · 2020-02-15T12:19:40.000Z

Measurements are optional, so we must skip reporting that particular measurement if it is not available

So this code should work?

Telemetry.Metrics.summary("somerepo.query.idle_time_ms",
  event_name: "somerepo.query",
  measurement: fn x -> Map.get(x, :idle_time) end,
  tags: [:environment ],
  unit: {:native, :millisecond}
)

This results in

[error] Handler {TelemetryMetricsStatsd.EventHandler, #PID<0.810.0>, [:somerepo, , :query]} has failed and has been detached. Class=:error
Reason=:badarith
Stacktrace=[{:erlang, :*, [nil, 1.0e-6], []}, {Telemetry.Metrics, :"-maybe_convert_measurement/2-fun-0-", 3, [file: 'lib/telemetry_metrics.ex', line: 639]}

Am I doing it wrong?

Answer 24 · 2020-02-15T15:00:29.000Z

Hi @hkrutzer. Nope, that was a bug, thanks for the report. It has been fixed in master.

Answer 25 · 2020-02-15T15:42:20.000Z

Thank you!

Answer 26 · 2020-02-15T18:19:53.000Z

@hkrutzer This is fixed in the Prometheus library but I see your example is from StatsD. If this is happening in the StatsD reporter, that will need to be fixed in there, as well.