apple/swift-metrics

RFC: Introduce means to "destroy"/"remove" metrics objects

ktoso opened this issue · 7 comments

ktoso commented

Importing from tomerd/swift-server-metrics-api-proposal#6

Rewritten the rationale in a more formal form, please discuss.


Define a way to "destroy" metrics

  • Previous ticket: tomerd/swift-server-metrics-api-proposal#6
  • Status: implemented #3 open to adjustments
  • For end-users: This is an additive change, it enables end-users to do things which they were not able to do before; it does not force them to do so.
  • For metrics library authors: This sets a strong implementation style recommendation, that resources shall be allowed to be destroyed at the users request. We ask that implementations important for the Swift Server ecosystem adopt this semantic, yet also allow that there may be implementations which do not have a need for implementing this call (e.g. stateless ones)

This proposal aims to explain the need for "destroying" metrics via the common Metrics API. The proposal explores a wide variety of possible metrics library implementations, and attempts to strike a balance that is safe, performant and should not block any implementation from achieving it's specific goals.

In order to bring more focus onto the discussion -- as it has been laying around, but perhaps its importance underestimated so far, the proposal will follow swift-evolution inspired format for the writeup.

Proposal

There must be a well-defined way to "destroy", metrics in long lived processes if we know a metric will never again be touched.

This proposal forms a case for why this API is required and best to be included right away, rather than as an after thought, and that it is best to perform the "destroy" operations explicitly rather than implicitly.

Having that said, the primary goal here is that there must be a common agreed way to solve this for all implementations of this API, otherwise the model breaks down for any external library exporting metrics -- so if someone had better ideas, we are open to discussing them, within reasonable timeline.

Terms

Before diving into the case analysis, let's define some terms to be on the same page when we talk about things.

  • metrics API -- for the sake of this discussion, this entire library apple/swift-metrics
  • metrics backend, metrics library -- a library which implements this API,
  • external library / "3rd party library" -- for sake of this discussion I mean this only as "some library which depends on metrics API, and wants to offer metrics. It can not and does not know about the metrics library." This is not about the 1st/3rd party thing in the sense of "an Apple library or not."
  • short lived metric -- the "short" is relative here, but the point is "lives shorter than the entire lifetime of the application

Analysis of cases where delete/destroy does not matter

Before we dive into the actual problematic cases, let us acknowledge that for a lot of metrics, the destroy() may never need to be invoked, and this is absolutely fine and expected.

Metrics which stay around for entire lifetime of app

This is very typical in HTTP servers which want to instrument and expose a "global view", e.g. offering: "histogram of all request durations", "gauge of memory used by app", "gauge of requests per second, globally for app" etc. We can see an example of such here in Vapor Metrics: https://github.com/vapor-community/VaporMonitoring/blob/master/Sources/VaporMonitoring/VaporMetricsPrometheus.swift#L14-L23

These are absolutely expected to never have to be destroyed -- these kinds of metrics are few, and stay around for the entire lifetime of the application. This proposal does not concern this type of metrics at all; with explicit destroy such metrics would never call it, and with implicit via ARC they would ensure to always keep strong references.

👍

Problematic cases analysis

"Caching metrics library" and "short lived" metrics

This is an example of what might quite often be implemented in the real world: a metrics backend which has a form of "registry,"
in fact, current implementations in Swift already fall in this category e.g.:

These libraries keep strong references to metrics objects in order to print or give them back to users when asked by label. There is no API that might allow removing those strong references, thus whichever resources for metrics we would allocate, stay there forever. This may seem fine for typical HTTP metrics which include global throughput, or per-path endpoints, but quickly explodes in size if metrics are high-density histograms (weighting potentially kilobytes), or many unique identities want to report metrics, but are known to disappear after a while.

Inspecting live system for more information "for a while"

Another case to highlight is a case of temporarily increasing "measure everything on this hot path for a few seconds", where we may have instrumented code, but the counters are not being hit in normal operation, however when debugging we may want to enable "measure everything" which would be too expensive to keep always on in production.

We may get a lot of counters and metrics from this, inspect them, and maybe find a bottleneck -- e.g. imagine a streaming application with many stages. And it is grinding to a halt -- you may want to measure "what is the throughput at each of the stages of this stream" to locate the bottleneck. You do not want those metrics to be reported forevermore from here on, since collecting them is too costly at this granularity, but during interactive debugging it can be invaluable to gain insights about the system.

In this example, the stream library or someone could has explicit information about when to stop measuring, and the metrics objects can be destroyed. This is only one of the examples where we might want to have explicit destroy(), calls but plenty other situations call for a similar mechanism. Hopefully this one is interesting enough to exemplify the issue.

Proposed Solution

Solution 1: library.destroy(Metric) or metric.destroy() 👍

This proposal aims to introduce an explicit function on metrics factory to "destroy" or "delete" (name to be decided),
metrics from the system; It is up to implementations to decide when exactly it will be destroyed, however they should aim
to do so as early as possible after such call:

public protocol MetricsFactory {
    // ...

    /// Signals the `MetricsFactory` that the passed in `MetricHandler` will no longer be updated.
    /// Implementing this functionality is _optional_, and depends on the semantics of the concrete `MetricsFactory`.
    ///
    /// In response to this call, the factory _may_ release resources associated with this metric,
    /// e.g. in case the metric contains references to "heavy" resources, such as file handles, connections,
    /// or large in-memory data structures.
    ///
    /// # Intended usage
    ///
    /// Metrics library implementations are _not_ required to act on this signal immediately (or at all).
    /// However, the presence of this API allows middle-ware libraries wanting to emit metrics for resources with
    /// well-defined life-cycles to behave pro-actively, and signal when a given metric is known to not be used anymore,
    /// which can make an positive impact with regards to resource utilization in case of metrics libraries which keep
    /// references to "heavy" resources.
    ///
    /// It is expected that some metrics libraries, may choose to omit implementing this functionality.
    /// One such example may be a library which directly emits recorded values to some underlying shared storage engine,
    /// which means that the `MetricHandler` objects themselves are light-weight by nature, and thus no lifecycle
    /// management and releasing of such metrics handlers is necessary.
    func destroy<M: Metric>(metric: M)

    // alternatively, delete the Metric type and provide overloads:
    // func destroy<C: Counter>(_ counter: C)
    // func destroy<R: Recorder>(_ recorder: R)
    // func destroy<T: Timer>(_ timer: T)
}

Open question 1, naming bikeshed:

  • destroy
  • delete
  • release was ruled out since it is confusing with ARC release terminology
  • better ideas?

It should be "dual" to the make, I'm open to ideas here.

Open question 2, should we add the Metric type to allow writing the single destroy<M: Metric>(metric) function, or rather ask implementations to implement 3 overloads. I am on the side of introducing the Metric type, but am happy to be convinced otherwise if there are good reasons.

Open question 3, shall the destroy() method live:

  • on the MetricsFactory (only),
  • also be added to MetricsSystem (and delegate to the factory), or
  • be moved to the metric types themselves so one could write counter.destroy()

I am strongly in favor of the current design and/or adding it to the system itself as well, as adding the function to the types themselves will lead us to:

  • grow storage size per metric by another pointer to the metric system which can be bothersome for light objects like counters (esp if an object would contain many conters).
  • this reference to the factory inside of the metric likely cause a reference cycle between the factory (if it has a "registry", which many advanced systems will have -- e.g. all current implementations in Swift), which we'd need to break and complicate the design even more.
  • the majority of use cases where destroy calls will be needed a library or framework will be doing these calls, and not end users (!).
    • if end users do want to destroy metrics explicitly, they likely know they want to do so.

Note that this design addition is strictly adding functionality of removing metrics, to implementations which otherwise would never release metrics at all. In that sense, this proposal does not make it "worse" or "more difficult" than existing API and implementations -- they did not release, and would continue not to release (unless they care to), and would be as well-off as they have been already.

Detailed design

A full implementation is provided in #3

Implementations are simple, and are the dual of making a metric.

public func destroy<M: Metric>(metric: M) {
    switch metric {
    case let counter as Counter:
        self.counters.removeValue(forKey: counter.label)
    case let recorder as Recorder:
        self.recorders.removeValue(forKey: recorder.label)
    case let timer as Timer:
        self.timers.removeValue(forKey: timer.label)
    default:
        return // nothing to do, not a metric that we created/stored
    }
}

Alternatively, 3 overloads of this could be provided, and the now-added for this implementation Metric type could be destroyed. Note that the removing of metrics could be implemented using various ways, but most of the time will boil down to removing by label, or some other identifier (which does NOT necessarily have to be the printable name -- e.g. if a library supported some form of nesting etc).

This, again, ONLY matters if a metrics factory as a form of "registry." If your implementation is completely stateless e.g. "print out whenever a metric is updated", this change does not impact your library, so you can implement it as:

    public func destroy<M: Metric>(metric: M) {
        // no-op
    }

It has been considered to provide a default empty implementation however this seems to cause confusion for reviewers as well as for people using the library -- they can not easily check if the implementation is empty or not just by opening the metrics library source. It is nicer and more explicit for understandability of a library to explicitly opt out.

As for user API impact:

  • for metrics which "stay for the lifetime of the app"
    • users nor library authors need to change anything.
  • for metrics which are managed by a framework or library and have fixed "short" lifecycle
    • library developers are likely aware of these lifecycles and know when to invoke destroy
    • if they have lifecycles, but do not destroy the metrics, it is not worse than status quo, and it is simpler to put destroy calls in the right lifecycle callbacks which such framework likely has (even if internally)
  • for metrics which are created and managed by end users:
    • most metrics are "for app lifetime" metrics, and not releasing is exactly same as status quo and same as many other metrics ecosystems
    • some metrics, which users know they want to destroy since interactively debugging or similar, they are now enabled to do so.

Alternatives considered

"Just™ Ride on ARC" 👎

"Just™" riding on ARC for correctness of a program and always releasing whenever the users "let go" of a metric sounds very tempting, however it yields very unexpected side effects, and blocks from easily implementing various useful patterns.

This seems very tempting, however we argue that it is abusing ARC and actively complicates the designs and understandability of lifecycle of metrics. Most notably, it makes the following style not possible to implement correctly:

// "Exhibit A"
// CAN'T implement this nicely with ARC based destroy()
class RarelyCreatedResource {
    static func make() {
        // we want easily measure how many times this is invoked
        // so we add the following line:
        Counter(label: "RarelyCreatedResource.creations").increment()
        // we assumed our metrics library is a "caching one" (see above)
        // ...
        // but if the implementation is ARC based, the counter is destroyed() right away after this line
    }
}

// Would have to be this:
class RarelyCreatedResource {
    static let creationsCounter = Counter(label: "RarelyCreatedResource.creations")

    static func make() {
        creations.increment() // always works, _regardless_ if ARC based or `destroy()`-API based.
    }
}

Another scary implication of such design is that it MAY lure implementations into performing more cleanup actions in deinit, which is best to be avoided and could/would most likely cause issues in multi-threaded environments (esp. since there are no guarantees about which thread the deinitializer is run on). Alternatively implementations would have to implement "periodically clean up all metrics which may have been released" -- finding those metrics again requires synchronization and potentially taking locks around the registry cleanup operation -- which may be contended by metrics user threads wanting to obtain metrics instances.

Another example where ARC complicates things in "weird ways", because we are actually abusing it somewhat to express lifetimes which do not necessarily match the lifecycle of the metric itself (!), is when we would like to express the following:

  • A metrics backend which collects all metrics, and prints them out every 3 seconds (or reports totals to somewhere).

Now imagine that we have a counter that comes in and out of reachability a few times during this period. This means it's storage would also be released as it appears and disappears via ARC lifecycle management. This would cause us to report wrongly to the user that the metric has a count of x while in reality it had some x+y+z value.

This is possible to work around of course, by separating the metric storage from the weakly referenced by the registry metric itself, e.g. like this:

// "Exhibit B"
// "work" but is abusing the ARC mechanism and making it difficult to reason about lifecycle
// NOT proposing we adopt this.

class AtLeastOnceCounterHandler: CounterHandler {
     let container: AtLeastOnceCounterContainer

      init(container: AtLeastOnceCounterContainer) {
         self.container = container
     }

      func increment(by: Int64) {
         self.container.value += by
     }
     func reset() {
         self.container.value = 0
     }

  }
 class AtLeastOnceCounterContainer {
     weak var handler: AtLeastOnceCounterHandler? = nil
     var value: Int64 = 0
 }

class PrintAtLeastOnceMetrics: MetricsFactory {
    var counters: [String: AtLeastOnceCounterContainer] = [:]

    public func makeCounter(label: String, dimensions: [(String, String)]) -> CounterHandler {
        let container = AtLeastOnceCounterContainer()
        let counter = AtLeastOnceCounterHandler(container: container)
        container.handler = counter
        counters[label] = container
        return counter
    }

This implementation makes it non trivial to understand when a metric has reached end of its life, and also makes it a requirement to start implementing "gc-like" structures in the registry -- it would have to scan the registry for "was possibly something destroyed?", which would need to take write locks around the datastructure perhaps etc.

All in all this implementation would make it:

  • hard to reason about
  • very complex, multiple cyclic references that need to be broken
  • needs to invent GC-like processes to clean up the registry; since doing so from deinitializers is highly discouraged.

Summary:

  • While it is very tempting to "ride on ARC" for the releasing of metrics objects, it is the semantically wrong thing to do.
  • "Ride on ARC" may yield surprising problems, like Exhibit A, which people could expect to work fine, until later on, on production, they discover they made a mistake; I would not be surprised if someone was then tempted to implement a strong registry themselves -- likely on strong references, and cause the entire ARC dance to be useless.
  • Since lifecycle in this alternate solution is strongly fixed by ARC, users would have to "keep passing around" the same exact instance of a counter, potentially causing much boilerplate.
  • It may seem that riding on ARC is less error prone, in the sense that "people won't have a destroy() call they can forget about", however in reality the majority of metrics are either: a) "forever" or b) "managed for end-users" by middle ware / frameworks / libraries which explicitly control lifecycle of some resources.
  • We can not infer "will never be used again" safely from just ARC alone, as the Exhibit A shows.

Prior Art (for destroy/release API)

Go

Rust

Java

Erlang

Summary:

  • Almost all libraries / APIs opt for explicit release/destroy APIs
    • technically runtimes like the JVM (or Rust, even quite the same with RC/ARC) could do "ARC style", but it is considered an anti pattern, so they do not.
    • it is worth noting that metrics lifecycle simply is not the same as object lifecycle itself for many of the API patterns we may want to offer, thus an explicit API more natural.
ktoso commented

I updated the writeup to be swift-evolution inspired, so it hopefully explains the tradeoffs and challenges of destroying metrics somewhat more.

There are some minor things to flesh out (like names) in the main proposal.
And I also followed up on the request by @MrLotU in tomerd/swift-server-metrics-api-proposal#11 (comment) to make saying "this metrics library does not do anything on remove/destroy" explicit -- which i agree is a good idea, and low effort for library developers.

More Prior Art

Micrometer ("SLF4J, but for metrics" sic) has also added support for removing meters in version 1.1.0 (end of last year).

Actual Use Case

I came across a use case for this recently where a long running JVM us used for running jobs that emit metrics. Not being able to clean up these metrics post execution was an issue.

Big fan! Great write @ktoso!

Open question 1, naming bikeshed:

destroy
delete
release was ruled out since it is confusing with ARC release terminology
better ideas?

Personal preference: destroy

Open question 2, should we add the Metric type to allow writing the single destroy<M: Metric>(metric) function, or rather ask implementations to implement 3 overloads. I am on the side of introducing the Metric type, but am happy to be convinced otherwise if there are good reasons.

I'm for 1 generic solution since in my specific case, all metrics inherit from the same protocol and the logic to remove one would be the exact same for all three. Though I can see why separate ones would be preferred.

A possible workaround to this would be to have 3 separate functions on MetricsSystem and 1 generic one on MetricsFactory though that might become too confusing 😄

Open question 3, shall the destroy() method live:

on the MetricsFactory (only),
also be added to MetricsSystem (and delegate to the factory), or
be moved to the metric types themselves so one could write counter.destroy()

I'm okay with either of the 2 first options. I'm agains the third for the reasons mentioned in the proposal. I think for ease of use it'd be nice to go with option 2, so in theory users only have to ever interact with one thing, the MetricsSystem.


I left implementation specific comments in #3

Open question 1, naming bikeshed:

markForDeletion, markForDestruction, markToDelete, or some other permutation of mark{}

For two reasons:

  1. It more aptly describes from the API of what this does: It just tells the system the metric object is no longer intended for use. Even the code comments say that it's possible this is a no-op, so the name should give some indication to users at the call site of the behavior.
  2. Code completion will have mark and make pretty close to each other - signaling the beginning and end of a lifecycle of a metric object.

Open question 2, should we add the Metric type to allow writing the single destroy<M: Metric>(metric) function, or rather ask implementations to implement 3 overloads. I am on the side of introducing the Metric type, but am happy to be convinced otherwise if there are good reasons.

I'm a little on the opinion of having 3 overloads and waiting to see how cumbersome it is in practice - especially if we have no other use case for a common Metric protocol. Should another use case appear at this level (meaning the metric API), I would have no problem of deprecating the overloads in favor of a generic.

Open question 3, shall the destroy() method live:

  1. on the MetricsFactory (only),
  2. also be added to MetricsSystem (and delegate to the factory), or
  3. be moved to the metric types themselves so one could write counter.destroy()

I think option 1 is the route to go - as the MetricsSystem right now is mostly focused as a container and entrypoint to starting up the system. Users work with initializers or factories to create metric object instances, so we shouldn't start mixing responsibilities of MetricsSystem.

ktoso commented

Thanks for the additional datapoint @ddossot! I added it to the list.

Thanks for chiming in about the open questions everyone, let's give the thread some time so others can also have a look and I'll sum up and adjust PRs accordingly soon then.

ktoso commented

Thank you everyone for chiming in, I worked on this some more to incorporate all feedback and the result is in: #17 (separate PR since the other one keeps causing github to crash with 500 for me... long story, seems it does not believe that repo is a fork).

Please have a look, I think that version addresses all concerns, and with regards to reaching to global state is no-worse-than what we already do in inits (since we have to reach to the system). The alternative would be to make metrics more heavy.