siimon/prom-client

Improve Event Loop Lag Metric

Opened this issue ยท 10 comments

Hello,

Thank you for all the work with this library!
We are using our own implementation of the event loop lag metric because the default metric did not work well for us and I was wondering whether this is something we could contribute back? Here are the two things we have changed:

Reset libuv-Histogram
The mean/max/percentiles that libuv provides (through perf_hooks.monitorEventLoopDelay) are never reset. After a couple of days all numbers are pretty much set and do not change anymore. There is no way to even distinguish between high load times during the day and quiet times during the night. I think it would be better to expose a moving average/min/max instead of the total-average/min/max.

By default the resolution that libuv uses is 10ms which means that libuv generates ~100 measurements per second. It would be possible to reset libuv's histogram every second (or so) and record its mean and max in circular buffers. The length of the buffers should be a bit longer than Prometheus' scape interval and get averaged when collected.
Instead of the circular buffer, a histogram could be used as well (proposed in #309).
There is some discussion about this in #278.

Subtract resolution from values
libuv uses a timer and records the time that has passed since the last invocation. This means that all values that libuv provides have the resolution value "added" to them. I think it would be better if this library could subtract this number before exposing the metrics. This could also be done in Prometheus but this is a rather specific implementation detail and could vary from process to process because it is configurable.
There is some discussion about changing how libuv measures the lag but this didn't get merged: nodejs/node#32018 and nodejs/node#32102. There is also some future work in libuv/libuv#2725.

I am happy to open a PR for this so that you can have a look.

Hi, appreciate the feedback, and reading through your comments in detail is on my list.

Any interest in making a PR?

Also, pending me looking more closely, do you think some of this should be reported directly to nodejs as issues with our core APIs?

Hey @ChristianBoehlke I came across this issue. Maybe you can chime in in this other related issue I just opened: nodejs/node#34661

Not exactly the same thing, but I thought it might interest you

@ChristianBoehlke I observed the same behavior and would be interested in your changes. Did you ever had the chance to open this PR?

@ChristianBoehlke: Hey, I hope that you're well. Word on the block is that you have an awesome fix for event loop lag metrics. Is this true?

Hello! Sorry I totally forgot about this. I have some time this weekend to open up a PR with our implementation.

Good evening, @ChristianBoehlke !

Are you able to check my PR and share your thoughts regarding it?

My solution is much simpler than proposed by you (one line added actually) and I am interested to know, why more complex solution can be needed, if it is actually needed.

Seems that @ChristianBoehlke is too busy to respond on this.

From our side i would like to share experience of using #459 fork

We have now exactly that information, that we want to see from beginning:

image

It would be possible to reset libuv's histogram every second (or so) and record its mean and max in circular buffers. The length of the buffers should be a bit longer than Prometheus' scape interval and get averaged when collected.
Instead of the circular buffer, a histogram could be used as well (proposed in #309).

I still have no idea, why not just use that libuv's histogram, provided by monitorEventLoopDelay, collect that precalculated stats and reset histogram. No data will be lost such way, and user can see accurate stats for measurement period, as you can see on screenshot above.

@yarsky-tgz's PR #459 looks reasonable and seems to address the very valid issue raised here.

The discussion in the Node.js issued linked in the OP are useful. In particular, it looks like we should expose an Event Loop Utilization default metric to help with interpretation (added in v12.19, 14.10, https://nodejs.org/dist/latest-v14.x/docs/api/perf_hooks.html#perf_hooks_performance_eventlooputilization_utilization1_utilization2).

Fixed in 0f444cd (v14 coming out tonight).

Is it a resolved issue (as of #370 (comment))?