hashicorp/raft

Add saturation metric for the runFSM and run (main) goroutines.

dnephin opened this issue · 0 comments

The two primary goroutines used by Raft (runFSM, and run (main)) are single threaded operations. They can saturate (take more than 100% of the available time to handle the incoming workload) before the CPU of a system reaches 100% utilization.

When this happens it may be possible to observe the problem using some existing metrics (ex: fsm.apply time), but properly interpreting those metrics requires deep knowledge of how raft works. It may also be a challenge to present the data on a dashboard because it requires summing the time, and knowing the aggregation period of the metrics to interpret the summed result.

The existing metrics may also not fully capture the time, because they only measure specific operations done by those goroutines, not the full work vs idle time.

This issue proposes adding two new metrics (one for each goroutine) which measure the amount of time those goroutines spent doing work. When compared to the wall clock time, this gives us a clear signal about the saturation of these operations, and how much buffer there is before the incoming work starts to cause a backlog.