/em-monitor

For monitoring the distribution of CPU-spans in your event machine reactor thread.

Primary LanguageRubyMIT LicenseMIT

em-monitor is a gem that lets you monitor your eventmachine reactor.

Introduction

As we all know, event loops are an awesome programming model. You can (mostly) forget about thread-safety, but you can still do a bazillion IO-things in parallel.

They do have one significant downside though: you can only run one CPU-thing at a time.

This means that if you accidentally spend 30 seconds running a bad regex, everything in your loop is going to get stuck for 30 seconds (that's about a million years in computer terms). This is particularly bad because one user who triggers a bad regex slows down all your other users for all 30 seconds.

EM::Monitor can't fix your code for you, but it can let you know you have a problem.

Usage

em-monitor wraps every CPU-span of code in your program and measures how long is spent executing it. You can then extract this data periodically in two ways. EM::monitor_spans calls a block with an array of raw measurements on a regular interval (by default 60 seconds), EM::monitor_histogram buckets all the measurements and then sums them. This lets you plot the amount of time that your event loop is spending running short CPU-spans against the amount of time that your event loop is spending running long CPU-spans.

EM::monitor_spans(interval: 1) do |spans, from, to|
  puts "Between #{from} and #{to} (#{to-from}seconds) there were #{spans.size} CPU-spans:"
  puts spans.inspect
end
#=> Between 2013-02-07 02:19:37 and 2013-02-07 02:19:38 (1.00 seconds) there were 7 CPU-spans:
#=> [0.000565469, 0.000564702, 0.000568218, 0.000564348, 0.005066146, 0.050109482, 0.050113617]

EM::monitor_histogram(interval: 1) do |histogram, from, to|
  puts "In the last #{from - to} real seconds, we used #{histogram.values.inject(&:+)} CPU-seconds"
  histogram.each do |key, value|
    puts "#{value} CPU-seconds in spans shorter than #{key} seconds"
  end
end
#=> In the last 1.00 real seconds, we used 0.1572 CPU-seconds
# => 0.0452 CPU-seconds in spans shorter than 0.001 seconds
# => 0.0619 CPU-seconds in spans shorter than 0.01 seconds
# => 0.0500 CPU-seconds in spans shorter than 0.1 seconds
# => 0 CPU-seconds in spans shorter than 1 seconds
# => 0 CPU-seconds in spans shorter than 10 seconds
# => 0 CPU-seconds in spans shorter than Infinity seconds

Plotting results

The easiest way to plot the histogram data is as a stacked chart. If your tool of choice can't stack charts directly you can call EM::monitor_histogram(stacked: true) and this will cause larger buckets to include the sum of all the smaller buckets in addition to the CPU-spans that fell into that bucket directly.

This will give you a graph of absolute time used per minute, which you can normalize to a utilization percentage in two ways:

# The absolute magnitude of the lines plotted here will be correct,
# however if you plot a stacked area graph the area will under-estimate the impact
# of CPU-spans of similar order of magnitude to `interval`.
histogram.map{ |key, value| value * 100 / (to - from) }

# Looking at the absolute magnitude of this graph will over-estimate CPU-spans
# in the short term, however if you plot a stacked area graph the area will be
# more correct.
histogram.map{ |key, value| value * 100 / interval }

If you need to combine the results from multiple machines you should instead use the EM::monitor_histogram(cumulative: true), and centrally keep track of the total cumulative CPU. Plotting the derivative after summing will give you a stable plot that makes sense when averaged.

To get a feel for how this works look at example/gnuplot.rb or example/librato.rb.

Meta-fu

There's API documentation if you'd like it.

Everything is licensed under the MIT license, see LICENSE.MIT for details.

Pull requests and bug reports are very welcome.