spandex-project/spandex

Pass Trace between process boundaries

Closed this issue ยท 17 comments

What's is the proper way to pass Trace between Elixir processes? I don't see public methods in Tracer interface how to obtain current Trace and Spandex module uses strategy.get_trace for start_span/update_span. Either I'm missing it or maybe you have some plans to implement it in future.

You can use current_trace and current_span to get the ongoing trace/span, and then you pass them manually across process boundaries. There, you can continue it with continue_trace_from_span or continue_trace.

The Spandex API around this is a bit rough (I hope we can make it smoother as we transition to a converged OpenTelemetry API), but I described an example of wrapping around Elixir's Task.async in a recent presentation I gave at Code BEAM STO

Here's the meat of the slide in question:

image

@zachdaniel
I don't see current_trace in the sources, but it's probably possible to implement it in a similar way as current_span

@GregMefford yeah, I thought about it, but in my case I am starting new process via Supervisor.start_child/2 and also wouldn't like to create new span, only pass Trace for later usage.

Also I see that there is no easy way to start/finish span in a some child process and get updated Trace struct in parent process. The only option is to change public API of those child processes and pass Trace struct in call/cast. In Tapper lib such kind of functionality is implemented via GenServer for every trace. Here I've seen talks regarding ETS strategy which could help with passing Trace struct between process boundaries, but GenServer implementation could be more efficient as an ETS strategy could become a bottleneck at some point. Any thoughts?

Ah, you are correct there is only current_span and current_trace_*id*. Are you using spandex_datadog? If so, the idea is that child process don't need to update the parent trace. They just send the spans that were generated on their end when they are done. Datadog stitches them together for you later.

So it is only problematic if you need to update some metadata about the parent span based on something happening in the spawned process.

I see that sending only occurs on finish_trace. For finish_span there is only update Trace according configured strategy, pdict by default

Ah, yeah this is a flaw in our documentation. You should use finish_trace in the child process. It won't technically finish the entire trace, just the version of the trace that exists in the child process. When the parent process finishes its trace with finish_trace, it will be sent to Datadog and Datadog will use the parent span id from the child trace to tie the two together.

Reopening because this, at the very least, highlights some flaws in our documentation. I've been so insanely busy recently, so I really haven't had much chance to work on this stuff.

I'm trying to understand either this is a problem with documentation or finish_span should have slightly different implementation?

What would we change about finish_span? We can't send the trace from a call to finish_span because they could be intending to start a new span. Starting a trace starts a top level span (a Datadog requirement), so it is not possible to know that the trace is complete without a call to finish_trace

To clarify an assumption: we send all the spans in a given trace from a particular Erlang process when finish_trace is called. So when you cross a process boundary, you need to start a new trace that has the same ID as the parent trace (which is what continue_trace does). When the child process finishes its work, it calls finish_trace, which sends up all the spans from that process, but the parent process could have already finished or will finish after the child process, independently. When the parent finishes, it calls its own finish_trace to send up the spans that its process was responsible for.

Datadog stitches things all back together, as if it had been a trace across several micro-services.

Thanks for this explanation. After many hours of trying to get new spans created inside a thread I was now understand why your examples use continue_trace and finish_trace instead of the span functions.

Ah, glad it was helpful and sorry for the confusion and frustration you had to go through! ๐Ÿ˜…

We are experimenting with some more convenient options for automatic propagation in OpenTelemetry using seq_trace, which will hopefully eliminates some of the confusion around this in the future. ๐Ÿš€

I was able to get it working using code that looks like this

  def traced_async(module, function_name, args) do
    caseTracer.current_context() do
      {:ok, tracer_ctx} ->
        Task.async(fn ->
          Tracer.continue_trace("#{module}.#{function_name}", tracer_ctx)

          try do
            apply(module, function_name, args)
          after
           Tracer.finish_trace()
          end
        end)

      {:error, _reason} ->
        Task.async(fn ->
          apply(module, function_name, args)
        end)
    end
  end

which generates beautiful flame graphs that look like this:
Screen Shot 2019-12-19 at 9 50 21 AM

Thanks so much for all your hard work on Spandex. It has made the argument for adopting Elixir at my company so much easier because of the observability it provides!

That looks like the right call :)