awslabs/soci-snapshotter

[Bug] Fix synchronous fetch count metric

turan18 opened this issue · 0 comments

Description

We increment a metric(SynchronousReadRegistryFetchCount) every time we need to fetch bytes of content synchronously. This isn't really indicative of registry fetches as we could have returned the content from our local cache. What we really need to know is the number of synchronous fetch operations that led to cache misses where we actually had to go over the network to get the requested bytes.

In a perfect scenario, this number should always be 0, meaning we should always have the content fetched and cached before the container needs it. In practice, this is almost always never the case, so looking at the metric should give us an idea of how well our background fetcher is able to fetch the needed content and an indication on how well optimizations like prioritized pre-fetching (eg: Load Order Documents #749) could benefit certain workloads.

Note: This isn't a perfect heuristic for determining the benefit of LOD as there could be several reasons that are unrelated to the ordering of span fetches as to why the background fetcher couldn't fetch and cache the requested content before hand. (eg: Maybe there were many sequential spans that the container needed and the background fetcher couldn't get to them in time or some transient failure occurred when trying to fetch/cache the content).

Steps to reproduce the bug

No response

Describe the results you expected

Ideally, we should increment this metric when we have to synchronously fetch the content over the network. We can move the metric increment to the end of getSpanContent. The metric itself is at the layer level and so it also requires the layer digest dimension, so we'll have to figure out a way to pass the layer digest into the span manager or just omit it entirely, although I think it may be helpful to be able to aggregate at a layer level.

Host information

  1. OS: N/A
  2. Snapshotter Version: v0.4.1
  3. Containerd Version: N/A

Any additional context or information about the bug

Related to: #749 and #225