open-telemetry/opamp-spec

RemoteConfigStatus cannot return status per configuration passed in AgentRemoteConfig AgentConfigMap

andykellr opened this issue · 5 comments

It was mentioned in this discussion that it would be useful to be able to associate the status in RemoteConfigStatus with a particular configuration supplied in AgentRemoteConfig. Currently there is only one top-level status and error_message.

One solution would add a corresponding map to RemoteConfigStatus where each individual status could be reported.

The original discussion was about the agent health and the request was about being able to report more details about the health.

This issue seems to address a different problem - the ability to report more fine grained status/errors about the configs received via AgentRemoteConfig message.

I think I am confused about why we think this is the same problem.

Discussed this in workgroup meeting today and decided that we would like to see more specific examples and use cases for this capability.

We will likely be able to add this additional per config status before or after 1.0 release in a backward compatible way, as an optional field. But before we do that we need to understand the use cases better.

Agreed that the problem that was originally discussed isn't necessarily coupled to AgentConfigMap, though depending on the agent, its health statuses may correlate well to separate config files. We probably should decouple the initial conversation from this.

In Elastic Agent, we currently report health at a few different levels of granularity. Regardless of which level of granularity we report the same two fields:

  • A status enum value (healthy, unhealthy, updating, upgrading, etc.)
  • A human readable message string to provide additional information

This is pretty similar to RemoteConfigStatus today, though the enum values are a bit different.

Here are the levels of granularity we report these on:

  • The overall agent health
  • Individual health statuses for each "component"
    • A component represents either a specific receiver or a specific exporter
    • Each component has an associated health status and message
    • Each component is made up of one or two "units", also with a health and message:
      • The receiver itself (if the component is for a receiver)
      • The output queue of the receiver to the exporter
    • Generally each component does map to a specific block of the agent's configuration on the receiver side or the exporter configuration (which is usually shared across most/all receivers).

This allows us to answer questions like:

  • Which / how many agents are having health issues with receiver X with health status Y?
  • Which / how many agents are having health issues with exporter X with health status Y?
  • If an agent is experiencing queueing, which receiver is saturating the exporter?
  • Alert me when X% of agents running exporter Y exceeds Z

The reasons we prefer to include this information in the agent management protocol instead of shipping it directly to the telemetry backend is three-fold, though they are related reasons:

  1. Querying capabilities: it's much easier to efficiently query from a single "table" or "index" from the management datastore rather than having to first query the telemetry datastore and then pass the results (potentially 100k+ agent IDs) to the management datastore in order to answer a specific question. This is especially difficult to do at scale if you need to do any counts/aggregations on data that crosses the two datastores.
  2. Simpler implementation: it could be possible to denormalize the data from the telemetry datastore to the management datastore to achieve (1), but it requires a significantly more complex implementation to "tap" into the incoming data. Not only is this more complex to implement, but it's harder to deploy and manage.
  3. Scalability: O11y telemetry is generally reported on a periodic basis (eg. every 10s). If we were to denormalize this data into the management datastore, we would increase the write throughput significantly. However, if the agent only reports this data when it changes, we can minimize the wrote overhead on the management datastore significantly. This could also be solved by making the denormalizaion process stateful to avoid these writes, but again results in a more complex implementation and sharding and/or vertical scaling of the denormalization process.

All of that said, there does need to be some "line in the sand" here and I think we should try to discuss where that should be. For instance, there are all kinds of metrics that a user may want to filter agents based on, such as write throughput ("give me all agents writing more than 10k EPS"). I don't think it's feasible to report all of these on the management protocol.

From my perspective, we could draw the line at health status, since I think this is the most useful for aggregating on in the UI / management layer. Aggregations are less useful for metrics, and filtering may be good enough for metrics. As I mentioned above, I think aggregating across two different datastores is more difficult and health status is likely common to be aggregated on.

Note: I translated most of the terminology here to the OTel Collector equivalents. Elastic Agent experts will notice that we don't have "receivers" and "exporters", instead we have "inputs" and "outputs".

We are preparing to declare stability of the spec soon. I would prefer not to make big changes to the spec before the stability is declared. Additive changes of course will be allowed to the spec after it is declared stable.

What I think is important to do now is to understand if addition of the extra details about per-component status can be done in non-breaking manner, as an additive change. From my cursory reading and incomplete understanding of the use case it appears to be possible. If anyone thinks that this this not the case please speak up, otherwise we will postpone the resolution of this issue until OpAMP spec 1.0 is released.

Discussed in the workgroup meeting today and decided to postpone this unless we hear new arguments about why this is needed before 1.0 release.