fralalonde/dipstick

New metric type gauge supplier

mixalturek opened this issue · 17 comments

It would be great to have a gauge that is able to call a function to get the value to report. Current gauges use push model, this would be pull. Inspired by https://metrics.dropwizard.io/3.2.3/manual/core.html#gauges.

Use cases:

  • various buffers sizes
  • dynamic thread pool sizes
  • system and process metrics, partially from /proc filesystem
    • application uptime
    • RAM allocated by process
    • number of threads
    • number of open files
  • many more

Reporting of the application uptime using the existing API:

// Setup
let app_start_time = std::time::Instant::now();
let bucket: AtomicBucket = ...;
let uptime = bucket.gauge("uptime");

// Manually schedule a timer to periodically call this code
uptime.value(app_start_time.elapsed().as_millis());

Proposal for the new API. (Fighting with borrow checker and multi-threading may occur, this is only to get the idea.)

// Setup
let app_start_time = std::time::Instant::now();
let bucket: AtomicBucket = ...;
// The callback is evaluated on each reporting (or somehow internally scheduled, if enabled).
bucket.gauge_supplier("uptime", || app_start_time.elapsed().as_millis());

I plan to implement the feature soon either directly to Dipstick or as its extension. If you like it I will be happy to contribute. I have already looked at the code and it would probably require significant internal changes due to the push model in InputMetric.write() so I'm asking in advance...

I agree this feature would be useful; I had it implemented in another metrics library some time ago. However I don't think a new metric type is required.

I would rather define a standard Watch closure type whose job is to observe a resource and report its value to a regularly defined metric (such as a gauge). This could then be either scheduled using a mechanism similar to flush_every, or set to be triggered when flushing an output (Bucket or I/O).

type Watch = Fn(u64) -> ()
let gauge = metrics.gauge("some_resource");
let watch: Watch = |now| gauge.value(resource.len());
let scheduled_watch = Schedule.watch_every(Duration::from_sec(10), watch);
scheduled_watch.cancel();
bucket.watch_on_flush(watch)

Maybe the target metric could be maintained externally to the closure and passed on every invocation, then it could also be used as map key to register / unregister watch on flush. Maybe the Watch signature could include returning a Result<bool> to simplify required error management in I/O resource monitoring and track internal metrics about how often watches actually record values.

As an extension to this, a dedicated Heartbeat (a constant-reporting Watch) could be enabled to make sure at least one metric is sent to downstream systems, marking the application as "active" even if no other activity is recorded. See #8.

Ok, that's also possible, this would be the external implementation I was writing about. I'm starting to work on it.

I have finished proof of concept which works but it is still quite dirty and only for AtomicBucket. What do you think about the API? I don't like the necessary Arc::new() when defining the callbacks and I'm not sure whether they should return MetricValue or Result.

I'm fighting with a generic support for all other types and a scheduling independent to flush_every() now. It's quite difficult for me and I'm uncertain in many things. Should I create new traits similar to Flush and ScheduleFlush? Is it the correct direction?

I had a quick look at the PR, I will try to find some time to play with it soon so I can give you feedback. I like the observe method, I think maybe it should be on the Input trait so that any implementer can benefit from it. As you mention (if I understood correctly), observe should be either be generic, taking in any type of Metric (Counter, Gauge, etc) and the observing closure would be required to take in Metric of the same type. Maybe another way of doing it would be to have observe defined on every Metric type so you could have let observer = my_gauge.observe(|gauge| gauge.value(resource)) and then either register the observer to be triggered on flush using a method on Input or schedule it using Scheduler.

I like the gauge|timer.observe(...) but I would afraid of cyclic dependencies (borrow checker) that would probably emerge. Current implementation of Scheduler fires a new thread for each schedule, it's no-go for now. But we can switch to a different implementation backed by a single thread or a thread pool, there are multiple crates for that.

My colleague showed me today how to ged rid of the Arc in the client API with no dirty hack, it should work, I will get it a try. (I'm still pretty new to Rust...)

The current scheduler is very basic and was always intended to be supplanted with a user supplied thread / threadpool in any non-trivial situation. The only thing is I'd be wary of is bringing in any non-optional heavy dependency set, especially tokio...

Be mindful that the Arcs appear in a lot of places because the configured components may be operating on different threads (using async queue), and there's no way to know in advance. Also, the Arc overhead should be minimal once the app is running, since they are only used for "persistent" components, not wrapping the metrics data itself. This excludes the the async queue, in which case the Arc overhead is expected to be offset by the extra threading. Again this could be optimized using a preallocated circular buffer or such other fancy struct.

The other place where Arc could show some minor overhead is on dynamic metrics creation, which I think you're doing. I would expect the total overhead of doing dynamic metrics to overshadow any Arc overhead. Another solution would be to use static metrics!()but with Labels and a custom output format, which is new in 0.7, but this would not work with AtomicBucket since the labels are discarded on aggregation.

I have managed to remove the Arc that was wrapping closure in the client API, metrics.observe("answer", || 42); looks much better. And I have also fixed a memory leak that was in the initial implementation.

I have rethought the idea of counter.observe(...), timer.observe(...), etc. once again. I'm still pretty unsure, but the observing probably makes sense only for gauges - or at least for 99.99% of cases. I can't imagine a real metric to be passed to counter or timer this way.

If you are ok with the current API and code, there will be two next steps:

  • It will be probably beneficial to make observing independent to flushing and ScheduleFlush.flush_every(), generally there may be different periods.
  • Remove default implementation of InputScope.new_observer() and add concrete ones wherever necessary.

I'm having a look at your observer branch. Here are a few comments:

  • There should be a new trait, FlushObserve: Flush
  • AtomicBucket::new_observer(...) should be split into two methods, GaugeObserver::new() -> Self and FlushObserve::observe<GO: Into<GaugeObserver>>(obs: GO) -> CancelHandle. AtomicBucket would impl this new trait, as eventually all other Flush implementers would, maybe with a helper class.
  • I am unfamiliar with the pub(crate) syntax, is this Rust 2018? Let's stick with just pub for now.

continued:

  • AtomicBucket observers should be a Set, since the observers are anonymous, and handles are used to cancel them (see Scheduler).
  • (Maybe in another PR) GaugeObserver could have its own observe_every(Duration) -> CancelHandle method that would schedule it for recurrent observation, untied to any flushing activity.
  • An example program demoing / testing the feature(s) would be great.

If you have other ideas, find that my design isn't cutting it, or are just somehow uneasy with this, let me know, and we'll find a way to make this work. This is a valuable feature, which is why I'm raising the bar. I am glad to have someone take interest in this beside me.

I'm sorry, I had different tasks with higher priority, I will continue probably this week.

Hi Francis, it's quite difficult to start according to your high level description for me. I have still some difficulties with thinking in Rust and it's compositions, it isn't simple language and the enter barrier is quite high. Can you please write, in a hour, skeleton of traits and API with empty methods to make it clear? I hope a concrete code or at least pseudo-code will significantly help to see all the connections in your design. I don't want to waste my time slot for this by searching proper solutions if you already have the design in your head.

Sure, I'll try to find some time today or this weekend to do that.

Ok, finally started to work on it, I'll post a link to my branch soon.

Here's a tentative PR #50

I'm sorry, I don't have any time for this these days. I will continue hopefully in the soon future.

Addressed in version 0.7.2 with observe() from #50