DistributionData and view.Rows in general do not get reset between Reporting Intervals
marwan-at-work opened this issue · 3 comments
I have a program that records the latency of a request using the OCHTTP server middleware.
In my program, I am using the DataDog OpenCensus Exporter to report this metric.
Therefore, I wanted to simulate a latency spike by having my handler do the following:
-
The first 4 minutes, the HTTP Handler takes between 200-400ms to respond.
-
The second 6 minutes, the HTTP handler takes between 1-2seconds (latency spike)
-
The third 4 minutes, the HTTP handler takes between 200-400ms (latency goes back to normal)
The latency gets reported every 10 seconds
In other words, OpenCensus calls DataDog's ExportView every 10 seconds.
OpenCensus collects the Min/Max latencies and passes them as DistributionData to the DataDog exporter.
My expectation is that for every 10 seconds, OpenCensus calculates the Min/Max/Mean/etc and passes them to an Exporter, but then it should reset those numbers so that they don't obscure the next bucket of time. Is this assumption incorrect?
Even if this assumption is wrong, let's continue with how this behavior has obscured the latency spike significantly:
The fact that OpenCensus does not "reset" the Min every 10 seconds, means that the Min stays the same across the entire lifetime of the Go process.
Therefore, if we had a dashboard of the "min" value only, you'll see that the min never went up, it stays as ~200ms for the entirety of the 14 minutes. Even though for 6 minutes, the Min value was at least 1000ms.
On the other hand, if you have a "max" metric, you'll see that it went up to ~2s, but then it never went back down even though the traffic spike ended after 10 minutes.
Predictably, if I "kill" the process and restart it, suddenly the dashboard gets fixed and that's because we no longer hold the "min/max" that was previously preventing the dashboard to be accurate. See the following screen shot for demonstration:
What you see here is that the Purple line is the Max metric while the Blue line is the Min metric.
When 4 minutes passed, the Purple line went up, but the Blue line stayed down. This is because even though new incoming requests were clocking between 1-2 seconds, OpenCensus kept comparing those numbers to the Min that was initially recorded in the first 4 minutes.
However, once I kill the process and restart the server (resuming the traffic spike) you can see the Min gets updated. Similarly, when the spike finished, the Max never goes down until I forcefully reset the Max.
As far as I understand, and at the very least, restarting the process should not significantly affect what the chart looks like as seen in the screenshot above
My understanding is that OpenCensus creates a metric for a bunch of Measures per Reporting Interval (I can easily be wrong about this). Therefore, if I manually empty out the DistributionData after every interval, you can see the Dashboard now looks very correct:
Finally, I have created a reproduction Repository here: https://github.com/marwan-at-work/rcrepro
I included a Vendor folder that fmt.Println
's the incoming DistributionData and you can notice that even when the traffic starts to go up to 1-2 seconds for 6 minutes, the Min value that DataDog receives stays the same.
AFAIK, there's no need to actually set up Datadog, you can just run the program and inspect the logs.
Also, I'm not familiar with the OpenCensus codebase (until this issue), but all I had to do to fix it was add this line: marwan-at-work@1380fae#diff-2000ebb97830a1f0f1c5c4856a737f78R236
Not sure how thorough the unit tests are, but this didn't break any of them.
@odeke-em gentle ping to see if you or anyone would be able to look at this. Thanks!
@marwan-at-work I just commented on your PR.