beam-telemetry/telemetry_metrics

Introduce trace/computed_trace metrics

Closed this issue · 55 comments

@arkgil @bryannaegele @GregMefford @tsloughter

Tracing has been a hot topic lately, in special regarding to what the conventions should be in telemetry. Although tracing would not be used by some reporters, it will likely be used by open census, APMs, spandex and others. This is also a good opportunity to standardize on tracing conventions (which we would have to document on trace).

Therefore I propose two new functions to be added to Telemetry.Metrics: trace and computed_trace.

trace

Trace is used when there are two telemetry events: foo.bar.start and foo.bar.stop. The metric will be defined with the event prefix:

trace("foo.bar")

computed_trace

A computed trace is a single event with a duration that will be retrofitted as a trace. This would be useful projects like Ecto. The metric will receive the name of a full metric + meaasurement with the duration:

computed_trace("my_app.repo.query.duration")

I am not a big fan of computed trace, because it sounds like almost it is a good thing but, computed traces are most always a hack, so it would be nice to have a name that better reflects that. retrofit_trace sounds way too horrible though.

Comments

Thoughts?

Thanks for getting the conversation started. ❤️

Background

To hopefully save 1000 words, I've put together a picture so we're on the same page about the data model of what we're talking about:

Telemetry Trace Figures 001

Each box is a "span" that has a size defined by the time at which is starts and stops. The vertical direction illustrates what the "parent" of each span is (both synchronous and asynchronous). The collection of all the related spans is called a "trace," so each span will know what its parent span is (unless it's the root span in the trace) and what trace it belongs to (among other metadata). The colors of the boxes illustrate what "service" the spans came from, which could be different libraries like Ecto and Phoenix, or it could be various HTTP services making synchronous calls to each other over the network. It's largely up to the application developer to decide how they want to model these things, but it's common for libraries to expose themselves as sub-services within the actual application that's using it.

trace

To clarify what we're trying to solve here, it sounds like the proposal is that this new Telemetry.Metrics.trace/1 function would simply return a struct like the other functions, which declares what the start and stop metrics will be, based on the prefix specified. Libraries would use this function to make it easy for users of that library to discover what metrics it exposes (e.g. by having a MyLibrary.telemetry_metrics/0 function that returns the results of calling these functions in a list). These lists of metric definitions from various libraries could be filtered or merged by the application using them, to customize which metrics should be presented to each Reporter.

Perhaps the details are out of scope for this particular issue, but OpenTelemetry and other APM solutions would implement their own reporters that know how to accept these metric lists, potentially filtering out the ones they don't understand and handling the ones they do (e.g. maybe some APM solutions can translate a distribution into StatsD metrics and also a trace into their internal trace/span structure.

Does that all seem right so far? I think that should work, given that :telemetry.execute/3 fires synchronously from the same process where the actual work is being done, so it essentially gives the APM library a hook where it could do whatever it needs in the process dictionary or :seq_trace label to keep track of the trace and span metadata.

Given the description in the Background above, I'm not sure whether it makes sense to provide both trace and span functions, or just span. The problem is that this is typically up to the application developer to decide what they want to model as a trace vs. a span. For example, when a request comes into your Endpoint, Phoenix would consider that a trace because that's its main entry-point, but maybe we also want to trace the request through Cowboy, at which point, the Phoenix Endpoint is just a sub-span of the overall trace. Maybe it makes sense for libraries to simply describe what spans they expose, and the APM reporter would need to be configured such that certain top-level spans are handled by starting/finishing the overall trace instead of a span. It could look something like this:

# Grab all the metrics defined by all the libraries we're using.
# We could filter them if we know there are some we don't want,
# or we could inject our own custom application metrics/spans.
metrics = Enum.concat([
  Phoenix.telemetry_metrics(),
  Ecto.telemetry_metrics(),
  :brod.telemetry_metrics()
])

# These span names will be treated as root spans in an overall trace
# because the represent primary entry-points to our application
traces = ["phoenix.endpoint", "brod.consume"]

Supervisor.start_link(
  [{:ot_telemetry_reporter, metrics: metrics, traces: traces}],
  strategy: :one_for_one
)

computed_trace

I don't like this concept much either. I'd like to understand better why Ecto, for example, can't expose a useful start and stop for some meaningful internal operations. I understand that there are lots of complicated cases where things are being done in parallel, but the beauty of distributed tracing is that we can model the parallel processes using spans (if it makes sense) instead of settling for one metric at the end. For example, when I call Repo.all/2, it seems like it should be possible to fire a start right at the beginning of that function call, and fire a stop just before returning. Going a layer deeper, I imagine there are functions somewhere in the adapter layer that are translating the query into SQL, and start/stop events could fire inside those functions, similarly with functions that are getting a connection to the DB, sending the request, and decoding the response.

I would really like to avoid supporting this at all if we can help it. It's problematic to back-calculate spans because if you never have a start event to "open" the span, then any spans created by other code during that time will not be properly nested underneath, but will appear to happen concurrently, since we don't know they exist until the sub-spans are already closed, making it impossible to establish the parent-child relationships the way that should have been.

If there's some fundamental reason that libraries like Ecto (I've also heard that the way Absinthe works makes tracing very difficult as well), I think it's OK to have a way to declare that a trace is defined by only an endpoint and a duration, but

Thanks @GregMefford for the follow up and the fantastic image! Replies below.

trace

Does that all seem right so far?

Yes! Note though I am not yet convinced on something like this:

metrics = Enum.concat([
  Phoenix.telemetry_metrics(),
  Ecto.telemetry_metrics(),
  :brod.telemetry_metrics()
])

I understand the rationale (which is to give users something that works with zero configuration) but I am struggling to balance this with giving users control over exactly which data and metadata they want to send with what precision. We are planning to solve this for now with docs and by making future Phoenix apps generate a pre-built structure out of the box.

Regarding the function name, I am fine with either span or trace.

computed_trace

I would really like to avoid supporting this at all if we can help it.

I understand it. I can't speak for Absinthe, but maybe it is hard to trace simply because the traces were not added?

Regarding Ecto though, the query event is always a "sink", there is no inner span that can help in there because it comes from a low level library called DBConnection. I am also not sure if we want to leak the implementation details: if in the next version we decide to decode the query in a separate process because of some implementation detail, should my spans actually change? I was hoping that users do not have to care. In any case, we could still have spans for when the repository function is actually called.

I can't speak for Absinthe, but maybe it is hard to trace simply because the traces were not added?

I think that is probably the case. The difficulty that I've heard people mention is that the declarative DSL they use does a lot of work at compile-time, so it's difficult for a user to insert meaningful hooks to add tracing and see what is using up the CPU time across the query process. I only bring it up because maybe it's a similar situation to Ecto's.

Regarding Ecto though, the query event is always a "sink", there is no inner span that can help in there because it comes from a low level library called DBConnection.

I think Ecto should not worry about leaking DBConnection details, but if DBConnection (or Potsgrex, etc.) is the library doing the meaningful work, it should also trace itself and Ecto would only trace things at the level it is responsible for. Here is an example using made-up module/function names because I'm not familiar with how it actually works. My point is that what I want these traces to tell me is how it really works (at some useful level of abstraction) and how my CPU is spending its time during a request with as few un-accounted-for gaps as possible.

Screen Shot 2019-09-13 at 11 18 48 AM

If you want to change the internal implementation to parallelize things, that's fine and I probably won't mind if I see the change in my APM tool when I update the library. If I have monitoring dashboards that are looking as details like DBConnection checkout latency, then I will have to make sure I'm paying attention to changes at that level anyway, because I have no other way to make sure my mental model of the system is accurate in order to monitor and alert on useful metrics.

That clarifies a lot, thanks @GregMefford. Regarding the proposal, should we just go ahead with span/1 then? Is it good enough? Or should we wait a bit?

I still need to think about how we would make the above become a practice in Ecto (and how to make it in a backwards compatible way, which is the hardest part). Adding outer spans (the Ecto.Repo.all above) is relatively easy but breaking the inner parts will be a bit harder.

Not sure if this thread is primary to the discussion, but I'm curious why people think Absinthe is hard to trace. I've added Telemetry hooks into the library itself, and we fire off Telemetry events at the start & end of various actions... This code is in 1.5 which is currently released as beta so most folks haven't upgraded I suspect.

[:absinthe, :resolve, :field, :start],
[:absinthe, :resolve, :field],
[:absinthe, :execute, :operation, :start],
[:absinthe, :execute, :operation]

One thing I've observed is that some people assume that their tracing data should wind up looking like their intuition of whats happening (in Absinthe without knowing how it's implemented, this looks very much like the "tree" of the GraphQL query itself). However, what we are tracing is the code. Execution of an Absinthe GraphQL query doesn't resemble a tree - it's basically a struct that is mapped over many time in Phases...

We should embrace that we are tracing the code execution - it's the whole point, we need to see exactly how it is behaving to know how to improve it.

Hi @binaryseed, this is great news, thanks for your work! Maybe that will fully address the problem. My only concern is that we are planning to standarize on start+stop, so maybe we should change Abinsthe while it is still on beta? So it would be this:

[:absinthe, :resolve, :field, :start],
[:absinthe, :resolve, :field, :stop],
[:absinthe, :execute, :operation, :start],
[:absinthe, :execute, :operation, :stop]

Wdyt?

We should embrace that we are tracing the code execution - it's the whole point, we need to see exactly how it is behaving to know how to improve it.

Duly noted!

Oh that's good news @binaryseed! I think the issue is that people were trying to do it from outside Absinthe itself, which IMO makes less sense than doing it from inside the library, but people don't generally feel like it's in their purview to change the library in order to trace their app code. 😅

Regarding start/stop, would it make sense to support span/1 as proposed, but also a span/2 that simply specifies both the start event and the stop event by full name instead of just prefix? Then, it would be flexible enough to piggy-back on whatever Telemetry events a library is already sending today, even if they were unaware of this span concept. If they happen to have events being sent primarily for metrics purposes, then a user or an integration library (like New Relic) could repurpose them for tracing if they find that there are hooks in useful places.

We should embrace that we are tracing the code execution

Yes, that's the point I was also trying to get across - it's sort of different from "official metrics," the purpose is to show what is really happening, not a model of what is happening.

Absinthe PR sent here: absinthe-graphql/absinthe#782

Regarding supporting span/2, I am not sure. Because if we add special cases, then spandex will have to support them, open_telemetry as well, etc, etc.

@josevalim if the returned data structure defining the span is the same (i.e. in both cases it contains a complete name of a "start" and "stop" events), then I think it's okay to support this scenario.

But the first argument is still gonna be there, which might be a bit confusing:

span([:library, :call], stop_event: [:library, :call, :done])

I have a feeling that all metrics definitions in the library are a bit too complex because of many options we have, and I'd like to avoid that in this case if possible. But I understand that if we want to make start/stop a standard way of naming events then not having a two-argument version might help.

On the other hand... if the API is just to always specify the full name of the start/stop using these Telemetry.Metrics structs, then there’s no need for the start/stop convention, right? So in that case, we would simply have a span/2 that takes a start event and a stop event name, and a span/1 convenience helper that takes a prefix and applies the default suffices.

We are assuming that you will have to frequently specify both start and stop but I think we should make it so that's rarely the case. In other words, it is ok for span([:library, :call], stop_event: [:library, :call, :done]) to be ugly, because we want to discourage it.

One of the goals of span/1 is to impose a convention. span/2 defeats this purpose. If we are going to have span/2, then I am not sure what it brings as benefit compared to passing something like {[:library, :call], [:library, :call, :done]} to libraries like spandex?

The benefit of having a standardized Telemetry.Metrics.span/1 and/or /2 API is that then a library and define a list of span-compatible events it claims to expose, without needing to change the names it’s already using for other metrics or fire multiple events for tracing vs. metrics. The goal would be that then OpenTelemetry/Spandex/NewRelic don’t need to have specific integrations with each library - they would just consume a list of span definitions. If libraries don’t expose an easy way to get that list, then an integration library could do it for them, or an application developer could list them out manually.

The value is that it’s a simple and standard API that isn’t vendor- or use-case-specific, and could provide a zero-configuration way for users to just “measure everything” if that’s what they want. It sounds like you’re not sold on that idea, but I’m not clear on what the alternative docs-based approach looks like - would app developers need to copy/paste all the metrics definitions into each app they write, with the default being no instrumentation? To me, that sounds like a lot of maintenance work (e.g. when new metrics are added) when I expect that most people would end up just copying the whole list from the docs anyway.

Note that NewRelic is all in on opentelemetry https://blog.newrelic.com/product-news/observability-open-instrumentation-opentelemetry/

I know the chances are low that there won't be other tracing impl for Elixir but at least vendors are gathering behind OTel.

I'm not sure what this actually means Telemetry.Metrics and telemetry but, counter to what I had thought even yesterday, it might be better to standardize on the OpenTelemetry API as the "Elixir tracing api".

Not sure what that would look like, I think it is beneficial to be able to output events and the user only optionally include OTel and have events from telemetry turned into spans. But any "API" I'm now thinking should be the OTel API.

@GregMefford good point. Good news is that what you are expecting and what I am expecting are orthogonal. In this case we can definitely have our cake and eat it too. We can standardize as much as possible and still allow libraries to return their own spans. So I would still advocate for span/1 and hide the more complex API behind options.

@tsloughter my concern with depending on specific tools directly is because they will undoubtably change. The whole purpose of telemetry is exactly to be small so we can isolate ourselves from external factors. I know where telemetry will be in 10 years, because it is just 3 modules that do not have change, I can’t say the same about OT (and it is not that I am afraid it will disappear, it is just that it will change). Also it seems the current telemetry events can already be used for tracing, so I would propose to improve what we can in this area.

Another thought, I realized the use of events like [:absinthe, :execute, :operation, :start], is problematic.

The way spans work in opentelemetry/census-erlang is there is only 1 current span per process. When you finish a span you are finishing the current span. Otherwise you would need to keep the span context in a variable to pass to finish manually.

I bring it up because I see likely confusion if it appears you can, for example, stop the last started Ecto span with [:absinthe, :execute, :operation, :stop] but since the absinthe span is the last started you are actually finishing it.

I understand the desire for the simple telemetry library for emitting events and handling the. But starting to think again about the overlap and duplication of work that is likely to occur, or already has in some cases -- right now mainly in the area of reporting metrics which will have one for Telemetry.Metrics and one for OpenTelemtry.

@tsloughter well, could this be solved by a telemetry_metrics backend for OpenTelemetry? This way instead of telling people to implement a bunch of backends... we just tell them to use OpenTelemetry? We can even settle on OpenTelemetry for reporting altogether (without metrics) as long as collecting and reporting are two separate things. I just want the libraries to be able to rely on a small, almost immutable dependency as a building block.

The way spans work in opentelemetry/census-erlang is there is only 1 current span per process. When you finish a span you are finishing the current span.

@tsloughter Is this a rule of OpenTelemetry as a whole, or just a census-erlang specific rule? It strikes me as a bit opposed to the idea of Spans, since by definition they can be nested in a tree structure...

https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/api-tracing.md#span

For example, in my New Relic agent, a process itself is a span, and any number of spans can be created in that process, and they will nest themselves according to how they were called.

This nesting structure is what gives Spans their descriptive power, since their nesting mirrors the call stack structure of the code. It doesn't seem like we'd want to elevate a restriction like that into telemetry itself.

@binaryseed it may have just been confusing how I wrote it. By "current span" I mean top of the stack, the one that will be popped and finished when finish is called. Its parent then becomes the current span.

I forget if this is a requirement. It is consistent with how most the implementations work since most create a "scope" of some sort, execute a body of code within that scope and when that scope ends it finishes the span. You can create as many other spans within the body of that scope but they get their own scope as well.

For example in Java:

// 4. Create a scoped span, a scoped span will automatically end when closed.
	// It implements AutoClosable, so it'll be closed when the try block ends.
	try (Scope scope = tracer.spanBuilder("main").startScopedSpan()) {
		System.out.println("About to do some busy work...");
		for (int i = 0; i < 10; i++) {
			doWork(i);
		}
	}

Or what we did in opencensus-elixir:

 with_child_span "traced" do
    :YOUR_CODE_HERE
  end

For opentelemetry-elixir this will change to be consistent with the opentelemetry API:

start_span "traced" do
  :YOUR_CODE_HERE
end

There's a crucial piece of Tracing as a whole that I haven't seen mentioned - the trace context needs a mechanism for being passed outside the app. The most common case is to pass the trace context when making an external HTTP request, which enables connecting together a Distributed Trace.

Perhaps it belongs somewhere else (probably in telemetry), but similarly to how Absinthe or Ecto instruments itself, I'd love to see a way for HTTPoison or Mint to instrument itself in a vendor agnostic way... not just timing metrics, starting & stoping spans, but crucially - accepting an additional set of HTTP headers.

Each "type" of external might have a different mechanism. HTTP has headers, but message queues, database queries, etc all have different mechanisms for passing metadata alongside the main message.

I imagine something like this, which could be as simple as a telemetry callback that returns a value:

defmodule HttpClient do
  def request(url, body, headers) do
    extra_headers =
      :telemetry.decorate([:httpoison, :request, :headers], %{url: url, body: body, headers: headers})

    validate(extra_headers)
    do_request(url, body, headers ++ extra_headers)
  end
end

In the New Relic agent for example, I'd attach to that and inside the function i'd determine if there's a current trace and if there is, i'd generate and return a header that encodes the trace context. (Right now this is manual instrumentation that the user must do at every HTTP call site.

That's a very rough sketch, very open to other ideas, but I'm curious what folks think

@binaryseed I agree, this needs to be addressed. I'm going through this pain at work right now.

Are y'all planning to adopt the coming standard trace header? https://w3c.github.io/trace-context. This would make things very simple for everyone if we could just have the clients adopt the standard versus having to support vendor-specific sets of headers.

@binaryseed that's interesting. Let's open up another discussion for that on the telemetry repo. But that's basically a way for us to ask for more information right? We need to figure out the API for this, what we are going to do if there are multiple decorators attached, etc. Does OpenTelemetry has anything simiilar to this? I guess it is the context object, do they have a way to export it?

I agree that context propagation is a critical part of tracing and it would be wonderful to have hooks in the clients for it instead of having to wrap around them in client code or in an integration library. I haven’t used it myself but I’ve heard from co-workers that Tesla’s middleware makes this very simple to do, for example. I think the goal of Mint is so be as simple as possible so maybe this framework would belong in some higher-level abstraction like Mojito or something, or just an extension API that you could easily plug into OpenTelemetry via configuration instead of wrapping it manually. It’s not clear to me how this would make sense to include in Telemetry. To me, that’s getting closer to the point where Telemetry should just be replaced with OpenTelemetry.

@josevalim my understanding is that OpenTelemetry will standardize on the trace-context spec linked above, which is the result of converging the various vendor-specific protocols that all basically worked the same way but weren’t compatible due to naming differences, etc.

Something else to keep in mind with context propagation is that you may not always want to propagate trace context from callers or to downstream calls (maybe you consider it proprietary and don’t want to expose your trace IDs, don’t want to risk visibility into your internal infrastructure, etc.). So any integrations should make it possible to opt-in or opt-out of tracing on particular calls as needed.

@GregMefford very good point. In that case it is probably best to have convenient APIs for specifying those in Mint, Hackney, etc.

Yes, opentelemetry is standardizing on trace-context.

Clients having the ability to set functions that are called and return headers to include in requests would be useful.

Another possibility is to standardize on an internal context propagation which clients can then be configured to propagate based on the type of client they are or implement their own, of course.

So mint, hackney, etc would have a configuration boolean for propagation (likely need more complicated options, like based on path/host/whatever), and when true they call the context's serialization function for the "http" format.

I opened an issue for discussion of context propagation (aka mutable events) beam-telemetry/telemetry#48

After the meeting today, my proposal is to go ahead with span/1 that assumes :start and :stop but we will allow options if you want to specify those by hand, in case a library is not following telemetry documented standards. Sounds good to you @GregMefford?

I'm thinking about the goal of having a single library to point to for user's needs and I don't think this is going to work out.

I like the idea of telemetry being the "blessed" / "use this" library that everyone can instrument with and the user is then free to use whatever they want to actually deal with the events. But then reality sets in of more complex libraries that if limited to telemetry will make it harder on the user or cause telemetry/Telemetry.Metrics will end up growing to encompass OpenTelemetry.

Context propagation is part of that, but not the only issue. One example is OpenTelemetry Resources and "named tracers". These are sort of attributes that belong to all spans and metrics within a certain "scope" and it is recommended that an instrumented library would always use them to "tag" spans with where they came from and give users control over like if they want any spans or metrics to be produced by library A.

Resources are similar but usually not something that would matter in the context of whether to use telemetry or OpenTelemetry directly because it is usually just things that are defined globally in the sys.config for the project or automatically collected, like the current kubernetes container name.

But Resources don't have to be static like that. We started a new working group this week to hash out the issues with the Resources spec and my main reason for getting involved with that is wanting to optionally move the service resource to the gen_server level. In that case the service will not be as simple as defining a name in your sys.config (for anyone familiar with Zipkin you might have defined a serviceName in your config that gets sent automatically by the client at the start of a connection and then is used in the UI and what not for searching and displaying spans from that service).

Basically I want to be able to have UI's like this from LightStep:

Where the circles aren't necessarily nodes but processes in nodes and the latency is the time calls are taking from the other service that called it.

It is all kind of hand wavey right now and I don't know that it'll actually end up being useful, but it is what was most recent in my mind for shit an instrumented library might want to do (and that we'll likely encourage) that they'll have to use OpenTelemetry API directly for.

The reasons against OpenTelemetry as the "blessed" way that we point everyone to are completely right, so I think we must instead simply not expect to be able to provide users with suggestion of a single library they can use for everything.

As for context propagation, I saw a great demo of it in .NET at kubecon which was also covered in this blog post https://devblogs.microsoft.com/aspnet/improvements-in-net-core-3-0-for-troubleshooting-and-monitoring-distributed-apps/

Part of the post is explicitly about OpenTelemetry, but if you look at the section "Passing additional context" the context, including baggage which is propagated over HTTP`, is part of their core libraries and not dependent on OpenTelemetry.

Breaking out context and propagation to its own library is an option.

@tsloughter maybe I am too naive... but it seems like everything you described above could be implemented on top of telemetry?

The user could say:

:open_telemetry.set_context(...)
# call a library that calls :telemetry

and the open telemetry handler for that library would be able to access whatever was given in set_context. Or am I missing an important requirement?

Do you mean "user" as in the end user who is developing the final release that is deployed?

If so, then, no. I'm saying there are cases that the library developer needs to do it. Like for setting the library's name for the tracer used within it or setting the name of the service in a gen_server's init.

In an effort to wrap up this thread/issue, let's try to agree on the next steps:

  1. Context propagation is a complicated and important issue, but orthogonal to the decision about whether to add these specific Telemetry.Metrics helpers. We should move all further discussion about it to beam-telemetry/telemetry#48 since @binaryseed has already tried to fork the discussion over there.
  2. It looks like we agree that it makes sense to have a Telemetry.Metrics.trace/1 and/or Telemetry.Metrics.span/1 helper, to enforce some naming conventions and drive library authors and also application developers to use them more consistently.
  3. I think we should try to match the OpenTelemetry naming conventions as closely as possible, since it's likely to be the leading standard for the foreseeable future, and we want people to intuit the actual behavior based on experience working with OpenTelemetry elsewhere.
  4. I propose that we only add a Telemetry.Metrics.span/1 helper (not trace or computed_trace), to keep the API lean. This allows for a library to easily mark the beginning or end of an operation, but leaves it up to someone else (a calling library, application developer, etc.) to decide whether that span is a top-level span in a trace or not.
  5. Rather than :start and :stop as proposed, per item 3, I think we should match the OpenTelemetry naming conventions of :start and :end.

I agree with @tsloughter that this simple API is unlikely to be the only library that we recommend folks use in their apps and libraries (or else OTel likely wouldn't be as complicated as it is), but I think it should be something simple that they could use as a minimal baseline recommendation. If they find that these simple APIs don't meet the needs of their users, then they can reach for OTel instead or in addition.

If you agree with all these points, please react with 👍 so that we know you have no further concerns. If you want to clarify something, please quote-reply and we can iron out the remaining details.

Rather than :start and :stop as proposed, per item 3, I think we should match the OpenTelemetry naming conventions of :start and :end.

We should look at which libraries have adopted the start and stop convention already...

  • Absinthe has, but the upcoming release is still in beta so we could change, but we'd want to do so very soon
  • Plug has, and it's been released for many months now

I propose that we only add a Telemetry.Metrics.span/1 helper (not trace or computed_trace), to keep the API lean.

Agreed.

Rather than :start and :stop as proposed, per item 3, I think we should match the OpenTelemetry naming conventions of :start and :end.

This has two major issues:

  1. Backwards compatibility

  2. We actually chose end originally but end is a keyword in Erlang, which requires quoting when used in an event, making the UI a bit more annoying for Erlang users

Given those two, I would actually vote to keep the start/stop convention.

Ah right!
I agree to only Telemetry.Metrics.span/1 and :start/:stop. 👍

I asked this in the last meeting minutes but I should raise it here for more public comment.

How do we want to handle exception events? Is the previous OT convention of :error with the reason and stacktrace being used in OTel? Phoenix has :error_rendered now which isn't generic enough. I proposed an :exception event in a PR to Tesla elixir-tesla/tesla/pull/324, but it feels like that should possibly be merged into an :error event.

Should :stop always be the event to close the span and :error used to add span context? This feels good but I could be wildly off.

That is actually a pretty tricky question, because you need to decide whether to send an error telemetry event, raise an actual BEAM exception, or both. In the case of an actual exception, does the library have to guarantee that it has already fired a :stop event for all the spans it is responsible for? That seems unlikely to work out, but the alternative is that if you rescue the exception, you won’t have any way to know how deep in the span stack you are, so you’ll always be out of sync unless you somehow keep track of which ones were opened where and pop the stack until you’re back in sync.

I think an :error event could be useful as a standard thing though, which would work like a log event, but could be used to provide some metadata about a failure that was handled but would be useful to diagnose in the current span. For example, maybe Phoenix would emit this kind of thing when it catches an exception and renders a 500 response that’s outside of the application code a user has written. It isn’t reraising that exception and failing to return an HTTP response, but it’s very likely to be something that the user of the Phoenix library would consider an error worth capturing in the span.

The exception aspect is what I was trying to figure out. The original thought (possibly flawed) is that we could listen for an :exception event and treat it as a :stop event. That does require a pattern on the implementer of rescuing exceptions and then either handle them or reraise if you want it to crash. Phoenix takes the former approach with the :error_rendered event.

So touching back on the previous comment I made, my proposal would be a standard convention of:

  • Start event - fired at the start of what is being measured
    • Event: [:your_app, :thing_being_measured, :start]
    • Measure: %{time: System.monotonic_time()}
    • Metadata: Relevant metadata. Use the name of the argument or data structure as key in the map, e.g. %{conn: %Conn{}} in a Plug.
  • Stop event - fired at the end of what is being measured.
    • Event: [:your_app, :thing_being_measured, :stop]
    • Measure: %{duration: non_neg_integer(), ...any other measurements}
    • Metadata: Relevant metadata. Use the name of the argument or data structure as key in the map, e.g. %{conn: %Conn{}} in a Plug.
  • Error event - fired in the case of errors or exceptions which have been rescued to catch info. Exceptions should be reraised where surfacing the exception is the desired behavior
    • Event: [:your_app, :thing_being_measured, :error]
    • Measure: %{duration: non_neg_integer() | time: System.monotonic_time()} - Duration in the case of timed functions, same as the :stop event.
    • Metadata: %{kind: Exception.kind() | atom(), reason: String.t() | atom(), stacktrace: Exception.stacktrace() | nil, ... any other relevant metadata} - Stacktrace is only used in the case of an exception

Unfortunately none of stop or error are guaranteed to be emitted because the process can always terminate due to a broken link. The only way to catch those would be by having a separate process as a monitor but that may be too much overhead.

So given the lack of guarantees, I would do this:

  1. Clearly document :stop may not be called

  2. Leave the guarantees around :stop to the person adding the hooks. In some libs, you may want to have a :stop even if there is an exception, in others you do not

It may be a bit on the loose end but those are the minimum guarantees that we can provide. Thoughts?

I think if we want this to work without having to manually pass around lots of tracing context, we have to have a gaurantee that every :start will have a matching :stop. If we have unmatched :start and :stop events due to something raiseing and rescueing in an uncontrolled way, then we will have no way to get back in sync again.

For example:

  1. Phoenix sends a :start when it begins processing the Controller action.
  2. Ecto sends a :start when it starts to work on a database call.
  3. Something crashes while working on the DB call, so Ecto never sends a :stop
  4. The user has a rescue inside their controller code, so they catch the error, but they can't really know how many nested spans have been :started but not yet :stopped. The only thing they can do at that point is tag the currently-open span with the stacktrace and mark it as an error and potentially move on if that makes sense for their application.
  5. At this point, all additional spans will appear as children of the Ecto span(s) that was never :stopped.
  6. When the Controller code is finished, Phoenix sends the :stop for the Controller trace, but it actually just closes the lowest-level Ecto span.
  7. When the entire trace lifecycle ends, our tracing library would probably decide that all remaining spans should just be closed so that the trace can be completed and sent on to the collector.

The way that we deal with this problem in Spandex (and I imagine most of the other APM tracing integrations do the same), is to encourage developers to always wrap the traced code in a try...rescue...after construct with a reraise, so that you can guarantee that spans don't get out of sync. Spandex makes this really easy with macros you can call manually, and function decorators you can add to transparently wrap a span around a whole function call. If libraries were expected to follow the same conventions in order for things to work, that would make things much simpler. I only takes one library doing the wrong thing, and the whole system falls apart. 😅

The alternative I see is that you have to hang on to the context created when a library fires a :start event, so that when it fires a :stop event, it would need to return the same reference so that the receiver could keep popping the stack of child traces until it gets back in sync. I don't like that solution much, because then we need to have a way to keep track of the references in each place, and I think that if we expect library authors to do that, we'd need to give them an easy wrapper macro to do it for them. At that point, it feels to me like the wrapper could just as easily do the try...rescue...reraise...after construct so that the receiver still sees exceptions exactly like they normally would, but they don't need to worry about matching up spans because they'll get a guaranteed :stop event (which would be sent from the after block), and potentially a separate :error event (sent from the rescue block), if we choose to have that as a standard thing.

One thing with exceptions is that with the OTP 21+ logger you can execute your handler in-process at the exception site.

That gets you access to the process & whatever data is inside the exception. Helpfully in plug the conn is in there, but in other situations you might want access to more...

I think if we want this to work without having to manually pass around lots of tracing context, we have to have a gaurantee that every :start will have a matching :stop. If we have unmatched :start and :stop events due to something raiseing and rescueing in an uncontrolled way, then we will have no way to get back in sync again.

Good call. So if we want to enforce stop, then we have two options to recommend to users:

  1. try/after

     try do
       telemetry.execute(start)
       # code
     after
       telemetry.execute(stop)
     end
    
  2. try/rescue/after

     try do
       telemetry.execute(start)
       # code
     catch
       kind, reason ->
         telemetry.execute(error)
         :erlang.raise(kind, reason, stack)
     after
       telemetry.execute(stop)
     end
    

I would prefer to recommend 1 due to simplicity. But can the tooling make any use of the error event? For example, would it be able to say that a span failed or similar?

Yes, the idea would be that the error event would always come between a particular :start and :stop because of the pattern used above. This allows a receiver of the events to know that it was the current span that should use the error information (we’d want to send in the kind, reason, and stack as metadata).

I was asking myself whether the library author would still need to think about where they want to send an :error event vs. not. I think it’s a safe default to always use the pattern that sends the :error event, because if it’s a meaningful operation worth tracing, then it’s also a meaningful place to know whether there was an error or not, even if it’s several spans deep into the library. For example, if the library code catches the error before it propagates to user code, that’s fine and the higher-level spans will simply not be marked as being errors. A user could still see that there was an error thrown and caught internally, though.

Should Telemetry include a wrapper function that does the try/rescue/after and fires the standard events? This seems like boiler-plate that would be unfortunate for users to have to sprinkle all over the place. I’m not familiar with how to do that in Erlang, but in Elixir, it’s a pretty easy macro like the ones I linked from Spandex above.

This is where macros and anonymous functions are nice. In OTel for Erlang we offer a function version of start_span which wraps in a try after and in Elixir the macro:

start_span "span-1" do
   ... stuff...
end

Still think this is getting out of scope for what telemetry should be -- a simple events firer and subscriber.

Yeah, my opinion is that if we want to recommend that people use Telemetry instead of using OTel directly, we should provide a convenience function/macro to make it much easier to do the right thing (i.e. library authors don’t have to learn the hard way that they’re not guaranteeing the :stop event).

If Telemetry is only to be a simple metrics/events-firing system, then I think it won’t be suited for tracing support, and we’d need to recommend that people use OTel directly (at which point, they should probably use it for metrics as well).

Should Telemetry include a wrapper function that does the try/rescue/after and fires the standard events?

IMO this is hard because the measurements in start is different from stop. You have at least duration and different time values. We could standardize those but it wouldn't be enough. For example, in Plug/Phoenix, you want to include the update connection, status code and what not on stop.

So I agree with the later conclusions. I think the best for now is to add span and then document the two ways you can emit the span with raw telemetry: either using try/after or try/catch/else.

Ah right, I forgot about various metadata being attached to the events. I’m 👍 on just defining the conventions then, and having docs about how libraries should ensure that spans are balanced via try/after with optional catch if you want to have the optional error telemetry event.

I have created a new issue here that reduces the scope of the discussion so we can finally add span/2: #53