dropwizard/metrics

ScheduledReporter may have missed reports

dkaukov opened this issue · 2 comments

Problem Statement

I think, fix for #1524 has undesired side-effects in our environment. Because executor.scheduleWithFixedDelay accumulating errors (due network latency, for example) when it goes over reporting period, we have "holes" in graphite metrics:
image

Possible solution

In our case, report bursts are probably lesser evil than holes, so maybe we can make it configurable.

We are seeing the same problem which is a result of #1524 and the switch to scheduleWithFixedDelay

We are sending counter metrics every minute to Graphite, Carbon Aggregator and we use nonNegativeDerivative to show only the changes per minute and we are seeing "fake" metrics.

image

The derivative function is sensitive to missing metrics, if one of the hosts aggregated is not sending and then sends again, it will show up as an increase in the metric for that minute. We are aggregating to hours in the above graph and the results are completely wrong, there is nothing happening on the server and all those bars are caused by missing metrics.

Here is an example of missing metrics showing up from two hosts:

image

I've captured the metrics on the Graphite server with tcp dump and saw the following reports from a single source server:

  • arrived 08:08:59.087611 metric payload timestamp 1659514138 (08:08:58)
  • arrived 08:09:59.343252 metric payload timestamp 1659514199 (08:09:59)
  • arrived 08:10:59.597794 metric payload timestamp 1659514259 (08:10:59)
  • arrived 08:11:59.856127 metric payload timestamp 1659514319 (08:11:59)
  • arrived 08:13:00.118483 metric payload timestamp 1659514380 (08:13:00)

You can see that we never receive data for 08:12, that's 1-min gap on the charts and one spike in derivatives.
First I suspected network latency issues and missing aggregation buckets on Graphite, but from the timestamps added by GraphiteReporter it is clear that the reporter is sending them late and missing the buckets. The increasing delay was an immediate sign of scheduleWithFixedDelay which was confirmed in the codebase.

Current metric reporting using fixed delay is problematic as the reporting interval is unstable and a function of how long does reporting itself take. The small delays add up and every few hours we see a missed interval and fake values with derivatives.

+1 for making this configurable.

This issue is stale because it has been open 180 days with no activity. Remove the "stale" label or comment or this will be closed in 14 days.