ScheduledReporter may have missed reports
dkaukov opened this issue · 2 comments
Problem Statement
I think, fix for #1524 has undesired side-effects in our environment. Because executor.scheduleWithFixedDelay
accumulating errors (due network latency, for example) when it goes over reporting period, we have "holes" in graphite metrics:
Possible solution
In our case, report bursts are probably lesser evil than holes, so maybe we can make it configurable.
We are seeing the same problem which is a result of #1524 and the switch to scheduleWithFixedDelay
We are sending counter
metrics every minute to Graphite, Carbon Aggregator and we use nonNegativeDerivative
to show only the changes per minute and we are seeing "fake" metrics.
The derivative
function is sensitive to missing metrics, if one of the hosts aggregated is not sending and then sends again, it will show up as an increase in the metric for that minute. We are aggregating to hours in the above graph and the results are completely wrong, there is nothing happening on the server and all those bars are caused by missing metrics.
Here is an example of missing metrics showing up from two hosts:
I've captured the metrics on the Graphite server with tcp dump and saw the following reports from a single source server:
- arrived
08:08:59.087611
metric payload timestamp1659514138 (08:08:58)
- arrived
08:09:59.343252
metric payload timestamp1659514199 (08:09:59)
- arrived
08:10:59.597794
metric payload timestamp1659514259 (08:10:59)
- arrived
08:11:59.856127
metric payload timestamp1659514319 (08:11:59)
- arrived
08:13:00.118483
metric payload timestamp1659514380 (08:13:00)
You can see that we never receive data for 08:12
, that's 1-min gap on the charts and one spike in derivatives.
First I suspected network latency issues and missing aggregation buckets on Graphite, but from the timestamps added by GraphiteReporter it is clear that the reporter is sending them late and missing the buckets. The increasing delay was an immediate sign of scheduleWithFixedDelay
which was confirmed in the codebase.
Current metric reporting using fixed delay is problematic as the reporting interval is unstable and a function of how long does reporting itself take. The small delays add up and every few hours we see a missed interval and fake values with derivatives.
+1 for making this configurable.
This issue is stale because it has been open 180 days with no activity. Remove the "stale" label or comment or this will be closed in 14 days.