beam-telemetry/telemetry

Instrumentation hooks for trace context propagation

Opened this issue · 17 comments

I'm interested in seeing telemetry extended with an API to enable an additional type of instrumentation - the mutation of an outbound request / query with Distributed Trace context information.

The general API could be implemented as an instrumentation hook that a library could build into their code that enables telemetry handlers to mutate specific data structures.

Taking HTTP requests as an example, we could imagine an HTTP client library adding telemetry hooks:

defmodule HttpClient do
  def request(url, body, headers) do
    extra_headers =
      :telemetry.decorate(
        [:httpoison, :request, :headers],
        %{url: url, body: body, headers: headers}
      )

    validate(extra_headers)
    do_request(url, body, headers ++ extra_headers)
  end
end

Then a handler could attach to [:httpoison, :request, :headers], and do some work in the handler...

defmodule MyTelemetryHandler do
  def handle([:httpoison, :request, :headers], request) do
    context = get_trace_context(request)
    generate_distributed_tracing_headers(context)
  end
end

This would provide a general mechanism for outbound instrumentation, without hard-wiring to any vendor-specific instrumentation details.

Very open to discussions about strategy and naming!

Issues to sort out:

  • Clarify naming for the various parts of the puzzle
  • Account for collecting annotations from multiple handlers
  • Have a mechanism or pattern for validating the data returned from handlers

For context, this issue has been discussed before in #24 and beam-telemetry/telemetry_metrics#46

In my mind, telemetry is a great place for this. It's the library that we're asking all other libraries to leverage to instrument themselves. I see is as the most neutral thing possible that could accomplish this. Really it's just a thin event dispatch mechanism that doesn't try to do much beyond being the connection point between the instrumented library and the consumer of that instrumentation.

I think it's great that OpenTelemetry is trying to standardize more in this space but I also think the reality of the situation is that those standards are not stabilized, and who knows what it'll look like in 5 years. Given that, I think it's even more powerful for telemetry to be fairly agnostic about things, and leave most of the work to the handlers themselves.

This gives us the best chance that the instrumentation itself is stable and can live on (since it's the thing that will be most interwoven in numerous codebases across the ecosystem), while the handlers can evolve to match the latest standards and best practices for observability.

For example - imagine the trace-context header changes format in 2 years. I'd much prefer if the open-telemetry handler was the only thing that needed to change to support that, rather than every single HTTP library having to be updated.

This opentelemetry RFC might be useful when designing this open-telemetry/oteps#42

I'm not sure it belongs in telemetry, but also don't have a suggested alternative that isn't a new context library which I'm also not sold on.

Why wouldn't it belong in telemetry?

I don't think there's much to this API really - it's not trying to solve all the problems regarding context propagation, it's simply providing instrumentation hooks that can be widely adopted such that it's easy for a people to build & evolve solutions to context propagation w/o requiring code changes throughout the ecosystem.

So from more technical point of view;
the difference between such :telemetry.decorate and :telemetry.execute would be the fact that the first one returns some modified state?

one returns some modified state

Exactly

Assuming that such :telemetry.decorate behaves like regular fold function.

  1. In case of failure one of the handlers, normal :telemetry.execute call simply logs the error and detach the buggy handler.
    For the :telemetry.decorate, What would I expect:
    a) the whole decorate call fails
    b) the whole decorate returns untouched state and logs error
    c) apply the decoration for all handlers except the buggy one, log error

I think option c) fits the best the tracing use case because it does not break the critical path of the call and still applies other handlers.

  1. Telemetry would not guarantee the order of the handlers, right?

decorate is "context" functionality. Maybe telemetry should be an abstraction for context that projects can share, but if so I don't think it can just be the addition of decorate. If it is going to become the "context" library for Erlang/Elixir we should discuss how that would look.

I misunderstood this at first. Based on the Observability WG meeting discussion I know understand this to be like if an HTTP library had a callback add_headers that you could register with to have it update the headers with trace context instead of first calling a function that updates the headers and then passing those to the HTTP client. Correct?

My reservation on this proposal is that it would not be a good idea for libraries, such as an HTTP client, to assume that we want to pass context forward. So I think that it has be explicitly passed in. There may still be a need for standardization but I think it would be a bit harmful if libraries start assuming all context is shared.

I'm inclined to agree with José on this. How would the client be able to discern at the library level when you intend to propagate context within your systems versus external calls where you may not want to leak that? It would seem that explicitly declaring behavior at the client implementation level gives the most control. I used the separate clients strategy in our systems and it worked well.

I agree that I would not want context injected in all HTTP calls. I would want to have to do it explicitly, or perhaps by using a special client module where I clearly know that it has this behavior. One pretty common pattern is to build a module in your app code that’s specifically for talking to another external service. That gives a pretty easy point at which you can decide whether you want to inject context for that particular call. For example: https://github.com/spandex-project/spandex_example/blob/master/plug_gateway/lib/plug_gateway/backend_client.ex#L8

The approach the last few of you point to is definitely workable, but it foregoes any solution that works "out of the box"

New Relic agents try to wire these types of things up for the user to make it such that achieving observability is as easy as possible.

We take care of any security concerns by:

  • Not passing anything sensitive in the context header (ie: just a few IDs that are inconsequential to pass to any arbitrary server)
  • Validate if the source of the request is "trusted" along each hop before connecting a Distributed Trace

Given the push-back, I'm assuming that with other systems, people are passing sensitive information in the trace context itself? That itself feels dangerous to me but perhaps it's a pattern out there that people use?

Leaving it all up to the user presents challenges as soon as we move outside of HTTP. Take for example database calls. There's no concept of "headers" so what some tracing packages do is add a comment to the SQL query itself that serializes the context (ex: https://docs.vividcortex.com/general-reference/query-tags/). I'm not clear how that would be achieved by the end user...

I think most of the time, there's nothing particularly sensitive in the trace context (just trace/span IDs). It's possible to include arbitrary "baggage" data which would probably be more sensitive, so there would need to be a way to only return the non-sensitive information at least, which feels like it adds some complexity.

I think the main issue I have is that I typically don't want magic things to happen that I didn't explicitly ask for and can't easily tell where they're coming from (this is one of the main reasons I like Elixir), so I prefer to have to manually inject trace context where I want it. That said, I think I'm getting off-topic on this thread because that's only one possible use for this proposed API.

My main concern about the proposed API itself (as opposed to the intended use-case), is that it feels weird and potentially hard to troubleshoot what's going on when you're making a blind call to some unknown number of functions that have registered themselves to listen to that hook and have an opportunity to transform the data somehow and in an undefined order. By comparison, it would be a lot more clear what's going on if the HTTP client did something like how Tesla supports middleware for injecting trace context in request headers, or parsing some kind of context or metadata from the response.

it would be a lot more clear what's going on if the HTTP client did something like how Tesla supports middleware

That's true but it doesn't seem within the scope of telemetry to try and enforce a middleware pattern on any library that wants to add this kind of instrumentation...

I see that there's a variety of concerns with this, I'm going to ponder this for a while and see if I can come up with another approach

Yeah the thing I'm struggling with is that I don't want to build a simpler-but-worse version of OpenTelemetry, because maybe then the answer is that library authors should just use the OpenTelemetry API dependency, since it's only a lightweight API and doesn't actually do anything unless the user chooses to include the SDK/implementation. The downside with that is that then there's some confusion about how to handle some libraries directly using OTel and others using Telemetry - would we need an OTel exporter that injects things back into Telemetry? 😬

At least for the cases of New Relic / AppSignal / Scout / etc. who want to have an easy integration for their customers, it seems like the right path is a deeper OTel integration, and it's "just" a matter of getting library authors to call :ot_propagation.http_inject/1 for example, so that they can automatically inject headers if their users configure it to do so.

It seems like what you're proposing here is a more-generic way for libraries to register that they would like to be involved whenever another library has an opportunity to inject HTTP headers (you're specifically suggesting :httpoison above, but I think it would be better to make it work across libraries the same way if possible). This would require that the HTTP library knows to pass its headers into this system and use the transformed result that comes back. Maybe it makes sense to simply have a :telemetry.http_inject/1 (and probably :telemetry.http_extract/1 that web servers could similarly use to automatically extract context from inbound requests), and then a way to register an injector and extractor so that OpenTelemetry (or something else) could be plugged into the "BEAM standard" Telemetry context propagation system. If nothing is plugged in, then it just passes through the headers unmodified.

If we're concerned about being HTTP-specific, then maybe it could be something like :telemetry.inject_context(:http, headers) and you'd register yourself like :telemetry.register_injector(:http, MyHttpInjector) (which could be done from wherever makes sense, either directly by a user or in some kind of VendorXYZ.setup_integration() function on app start. Then, we don't need to change anything to support context injection for some other protocol, as long as the injector knows how to properly/safely transform whatever input term it supports or expects.

It seems like what you're proposing here is a more-generic way for libraries to register that they would like to be involved whenever another library has an opportunity to inject xyz

Yeah, this is exactly what I'm trying to accomplish.

I'm thinking a bit about this again...

It seemed like one major concern was that some systems might not want to "assume" that every outgoing call should pass the context around.

I'd argue that this kind of decision should be the responsibility of the telemetry handler (ie: APM / observability library) that is attached to the decorate, not the responsibility of telemetry instrumentation API itself. A handler could easily provide a way for the user to configure which outgoing calls have context propagated and which don't.

I think it's important to make a clear distinction between the instrumentation hooks that will be built into libraries & the functionality of observability & monitoring... Instrumentation should be simple and stable - agnostic to most concerns so that it can survive in-tact as things around it change. That gives the observability solution the most flexibility & leaves it responsible for the job it's good at.

Other concerns relating to the structure of data that would be returned / handling error cases / etc all seem solvable once we decide this is a path we want to take. We can structure it such that we provide enough safety for this to be relied upon, similarly to how existing handlers are managed & detached upon failure...