datadog4s/datadog4s

Interest in using distribution for requestLatency in Http4s MetricsOps?

Closed this issue · 11 comments

Hello,

I'm using this library to export HTTP4s metrics (and it works beautifully, thank you!).

The only downside is that requestLatency is reported as a timer:
https://github.com/avast/datadog4s/blob/master/code/http4s/src/main/scala/com/avast/datadog4s/extension/http4s/impl/DefaultMetricsOps.scala#L57

This has the effect of:

  1. Causing the statsd sync to aggregate these metrics at whatever percentile is configured, which makes them less useful unless you break your queries up by statsd reporter (because aggregating percentiles is statically invalid)
  2. In my case, leaving me at the mercy of a rather unhelpful 95th percentile statsd sync.

Obviously it's not that difficult to create our own MetricsOps, or wrap this one, to provide a distribution, but I wanted to check and see if there would be any interest in incorporating those changes upstream.

I see multiple forms his could take:

  1. Replace the current timer with a distribution. This is backwards-incompatible for existing dashboards, so probably not desirable, though it keeps things simple.
  2. Report the latency as both a timer, and a distribution. Wasteful, but backwards-compatible and still pretty simple
  3. Refactor the metrics ops to make the type of metric used for latency selectable by the user, leaving the DefaultMetricsOps implementation using a timer and having a Distribution-backed alternative
  4. Keep things as-is, we implement our own ops.

Just wanted to gauge interest before I went down any particular path.

Thanks!

Hi, thanks for reaching out! I'm really happy that you found the library useful, nice to hear that ;)

I'm definitely interested in fixing this issue. When we developed the library, distributions were not available but i'd be happy to update them. After all, the tagline of the library is to make great monitoring easy and this is def. a step in the right direction.

There is a couple of questions that i'd like to find out:

  • are distributions priced differently than histograms? It sounds like they put more strain or ddog infrastructure so i'd suspect they might be
  • what is the 'backward incompatibility' story like? I'd kinda expect change from histogram (which timer now uses, altho it's masked as execution time) to distribution would be comaptible 🤔
  • does Timer[F] built on top of recordExecutionTime even make sense? It sounds like there really is no good use case and all timers should be implemented on top of distributions

Let me take a look at these and get back to you. Either way, I'm very interested in solving this problem, it would make our measurements more truthful aswell :)

So after sleeping on this, what do you think about following

  • having a global config for MetricFactory which would allow the user to choose whether to use distribution or histogram for Timer[F] ... maybe we could even have an option for migration that would use both at the same time but im not sure that would work (for example would there be conflict in metric names?)
  • having a config option for when creating Timer[F] which would specify whether to use gauge or histogram and allows the user to override the global config

I think the Timer[F] implemented using histogram is flawed anyway so this would fix it for all timers. And i think it actually even IS backward compatbile

based on docs it seems that by default it has same aggregations, but also allows you to explicitly enable collection of additional percentiles.

Thoughts? Also, @hanny24 what do you think?

@blandflakes please take a look at #341 and whether it will solve your problem in satisfactory manner :)

(i might do some small refactorings but the bigger picture should remain)

Oh wow, that was fast. I'm taking a look now, but what you describe is very much in line with what I had in mind. Let me hop over there real quick...

so the changes were merged ... let me fix up the website aswell as part of the different issue (hopefully tomorrow) and than we can release it

🙏 thank you so much! We aren't in a rush (though I will definitely swap to distributions as soon as they're available).

So version 0.12.0 should be out and contain the changes we discussed. Thanks for reaching out and pushing us to implement this ;) Also, i would love to hear what are you using this lib for ... which project/company etc :)

let me know if you run into any issues :)

Thank you! I'll take a swing at some updates on our projects. I work for Disney Streaming! We have some Play services, but our team prefers the cats-effect ecosystem, so many of our services centered around playback are HTTP4s. I was desperate for some service-level metrics (we have ELB metrics from AWS, but that doesn't let me view by route), and this library fit in perfectly. The classifier made it easy for me to add dimensions for the type of request I was serving.

I see, very cool. Thanks :)