beam-telemetry/telemetry

span needs to pass context to the stop event

tsloughter opened this issue · 16 comments

Opening this really quick, can't expand on it right this second. But the span function needs to be able to take some result from the start event to pass to the stop event so that the right span can be ended.

That or span needs to be able to work another way that isn't based on events.

Trying to figure out how to integrate with OpenTelemetry 1.0 without a lot of overhead for the user and so that it doesn't have the issue of not knowing what span you are end'ing when end is called in the stop event.

So basically we need to be able to pass some kind of "token" to span/3 that would be included in both start and stop metadata? Or that span/3 would generate that "token" itself and included it in both events?

Generate/return and then pass to stop. No need for it there to be any token passed to start. The point is to be able to end the span that was started in start.

Cant you correlate them based on the stack? Or the idea is exactly to simplify things?

The stack isn't reliable. OpenTelemetry doesn't keep a stack because of this. In Otel now you have to either manually call a function that sets the span in the pdict or use with_span which takes a function (or in the case of Elixir is a macro with a body) and will set the span to current in the pdict because it can also end the correct one when the function ends.

A stack means a call to end is just a hope that you are popping the stack in the same order and that you don't miss any.

Sounds good to me, it should be an easy change. Is make_ref() fine? Or do we need it to be an integer?

In a perfect world it would be a result of another call so we could return the actual span context.

I suppose since span acts like our Otel with_span we don't have to worry about stop running in a different process, that was a big reason I wanted it to be the actual span context, but now that I think about it I guess that isn't required.

I guess in a real perfect world we could just set what the span function does entirely so it doesn't have to trigger events but instead could just call OpenTelemetry's with_span with the function passed to span and then nothing has to be done to metadata.

Pretty sure I brought that up at one point but I can't remember with who. Instead of having span result in multiple events have it only trigger one event, an event that includes the function to run.

So yea, a ref should be fine, @bryannaegele might have some thoughts on that first though, but I'd like to have the idea of a telemetry span function that I guess triggers 1 event and includes the function so that the integration can do all the surrounding span logic and run the user's function itself.

Also, a side note, my understanding is Plug and Phoenix (probably Ecto too?) don't use span currently. To have consistent and reliable traces that needs to change.

Sounds like we're on the same page. I was thinking of just tracking a counter in the pdict to keep memory down. Refs are a little heavy and we just need something unique within a process. Folks can build up from there if they need to track at the node or global level since they'd have access to the counter, event name, pid, etc during the callback.

I'll throw a PR up for comment. I'll still need to do a stack "best effort" implementation as a fallback in the otel-telemetry bridge library while these things propagate, but this should just add a further level of guarantee.

Looks good.

I still think adding handle_span would be nice instead :). Is it because that is an API change or ppl just not a fan of the idea? Instead of changing how current span works it could be a new function with_span. But I also see that would be confusing to have both...

A new function will cause further fragmentation. People won't migrate to it quickly, which means less useful information in OTel. I would just go ahead and change span. Adding a new key to the map should be backwards compatible as long as the key is considerably namespaced.

Pretty sure I brought that up at one point but I can't remember with who. Instead of having span result in multiple events have it only trigger one event, an event that includes the function to run.

I think one of the reasons for including the context is just so we know it was emitted by telemetry and that it has certain guarantees. Otherwise people can send things that look like span but aren't really (i.e. they my forget to emit the close on error or similar).

Yea, agreed about the fragmentation and adoption :(

I'm not sure what you mean about things that look like spans but really aren't in response to my proposal about span being able to be like handle_span.

To be clear I was suggesting a call to span like it exists now:

telemetry:span(Name, Metadata, Fun)

Instead of the current logic it runs would call handle_span that the user can supply, and in OpenTelemetry's case it would be something like:

handle_span(Name, Metadata, Fun) ->
    ?with_span(Name, #{attributes => Metadata}, Fun).

I'm not sure what you mean about things that look like spans but really aren't in response to my proposal about span being able to be like handle_span.

@tsloughter it is my bad, I quoted the wrong thing. It was meant to be a reply to this:

Also, a side note, my understanding is Plug and Phoenix (probably Ecto too?) don't use span currently. To have consistent and reliable traces that needs to change.

Adding the context allows us to differentiate between telemetry span and a hand-rolled one. :)


In any case, I don't know how OTel would tell telemetry to use a particular with span function (and how those would compose if multiple people want to use span differnetly).

Aah, ok. But then not sure how that relates to plug, phoenix, ecto not using span :). Is the plan to update them to doing so so this context information is available?

As for with_span I was suggesting it act just like handle_event but instead of calling handle_event it looks up in the ets table (the same or a different table) for any attachments and calls the handle_span function. So only difference is the name of the handle function and the fact that it's arguments include a function to run.

The execution model in Plug and Phoenix prevent any macro wrapping and Ecto is still a single point "after-the-fact" event. I don't see a path to a world where everything fits into a single function call or macro. The other thing to consider is there is no mechanism for further adding attributes in the handle_span scenario which you have in a tracing library, so what you have access to via telemetry is different between the start and stop events.

Hm? You can still add attributes with handle_span. The users code can use Otel directly to do things like add attributes to the currently active span, the handle_span is so a library like Plug can be made to create a span with Otel if the user wants.

I thought Plug or Phoenix now did something with try/after for sending the events and would thus be able to instead use span and pass the body as a function to it instead of the manual events it does before try and in after.

The handler being called can be named anything, it doesn't have to be handle_event, that's just a convention in some libraries. It can just be an anonymous function if you want. I guess that's why I'm getting confused as to what the difference is you're proposing. It's still a single handler that has different levels of detail executing at the start of the function and at the end.

Oh right, bleh. Difference is just that span takes a function to run, so that changes the arity/spec of the callback.