Possible to not send distribution metrics?
devshorts opened this issue · 7 comments
If we're wrapping lambdas that use shared components that are emitting metrics using statsd format of increments/etc, then if we send the same metric name using distribution metrics the metrics conflict in the datadog api.
I recently ran into this issue where we had a lambda temporarily emitting distribution metrics (because we didn't really understand what this is and why its different from standard statsd metrics) and it clobbered our existing metrics graphs.
Is it possible to use datadog-js and use existing statsd formats?
It's a long story, but you must use distribution metrics in Lambda in order to have correct result, and conventional gauge, counter metrics could lead to undercounting.
So you are submitting the same metric from different sources? from lambda using distribution and from hosts using increments? If so, I would recommend update both sources to submit distribution instead.
Batching counter/gauge metrics in the Lambda and submitting to the API directly on your own will still result in undercounting, otherwise we could do this for you already. The long story is that for the same metrics, if multiple data points with same timestamp and set of tags get reported from concurrent lambda execution environments (very possible, unless your lambda is a cronjob with concurrency up to 1), only the date point that arrives last will be processed. e.g., data point A (my_metric, 1, service:a,env:dev,functionname:xxx) and data point B (my_metric, 2, service:a,env:dev,functionname:xxx) get reported from the same logical lambda function but different execution environments, whichever arrives later to Datadog will be processed. This leads to undercounting. This is how Datadog works and that decision was made from day 1. It's not an issue for host applications, because data points from different hosts will alway have a unique host
tag to separate them out.
Ok that makes sense. Thanks for sharing.
However its not reasonable to start to emit distribution metrics from host applications given the same metric name will now have lost all the visualized/reportable data. The only feasible way to migrate to this is to change literally very metric name and discard historical data.
Is there a migration story that is possible to unify this data? The existing datadog documentation at least for node (https://docs.datadoghq.com/integrations/node/) references using hotshots which is a statsd emitter and not custom for dogstatsd.
Yes, there isn't yet a good way to submit distribution from node.js applications https://docs.datadoghq.com/metrics/dogstatsd_metrics_submission/#distribution. Just curious, why do you need to submit the same metric from completely different environments? Be honest, this is the first time someone flagged this kind of issue to us. In the past, we have seen customers migrating from non-distribution to distribution, and we usually recommend adding a suffix to the existing metric name, in order to preserve the history.
We have over a years worth of data and our historical record is incredibly important to us, losing historical references to another metric name is not ideal. Especially when datadog charges by unique metric name and tag (we'd be effectively doubling our metrics overnight!) :)
Second, we use a lot of shared code. Some of that is in our lambdas and some of that is in our services. We want to have the same metric taggable across wherever it ran. Consider a metric like "log.level" tagged by "info, warn, error". We'd want to be able to see that in lambdas or services, and if we are using a common logger (we are) then we'd want that information.
In fact the root issue here is we pulled in distribution metrics for lambdas not really understanding the implications, and when we emitted a log metric like above we wiped out all of our service metrics forever. It took us quite a while to work through your support team to get our data back.
Consistency in log names and formatting is critical in most non trivial business applications. Changing metric names also impacts existing monitors.
This is a non trivial suggestion for a problem that really is an issue with the datadog architecture.
From a paying clients perspective, it sounds like our options are one of the following:
- Accept undercounting in lambdas and submit via http
- Migrate everything to distribution metrics and accept either full data loss
- Migrate everything to distribution metrics and change all metric names and now pay double for the metric count but maintain pseudo historical data across two different metric names (one for old without distribution and one with distribution)
Yeah, that makes sense. I think there are two other potential options, I suggest you ask our support team to escalate the request to the engineering team and see if possible:
- Pick a new metric name, say
<existing_metric>.dist
or whatever you prefer, and ask if the engineering team can help update your monitors and dashboards to use the new metric via a script (I'm not on the metric team, but I suspect you are not the first customer needing this). - Continue using the same metric name, and ask the engineering team (don't think support can do it) can help stitch the history of the existing metric to the new distribution.
BTW, I'm going to close this issue, since there isn't any change we are going to make in this specific code base. Please contact support if you would love to log a feature request. You can also reach our serverless team (or me) in our Slack community https://github.com/DataDog/datadog-lambda-js#community to follow up on this question or asking other questions.
Thanks for your valuable feedback!