openzipkin/zipkin4net

Client/server tracing won't work if server calls another endpoint of its own

s-stoffel opened this issue · 3 comments

When moving from Medidata.ZipkinTracerModule to zipkin4net, I observed the following problem.

One of our endpoints using zipkin tracing is calling another endpoint of the same server/application (i.e. it's the same process). Tracing is enabled for incoming requests (ZipkinMiddleware for Owin) and also for outgoing HTTP requests (HttpClient with TracingHandler). We would have expected that in the Zipkin UI the call to the second ("inner") endpoint would be properly listed as a sub-span of the first ("outer") endpoint. But although the sub-span was somehow displayed, it was shown as "unknownservice", and the "client send" annotation was missing in the details.

The easiest way to reproduce this situation with the examples in the zipkin4net repository is to replace the callServiceUrl (pointing to the backend) in the App.config of example.owin.frontend with an URL pointing to itself. For example http://localhost:8081/foo - this way the frontend, when called with http://localhost:8081, will call itself again once and return 404 (the suffix in the path preventing an infinite loop). Start the frontend service and call http://localhost:8081. This will return the 404 response (as expected). Looking at Zipkin UI it should look similar to this:
grafik
grafik

After intensive debugging I could see that the spans that are sent by the ZipkinTracerReporter/HttpZipkinSender were kind of mixed up: one span containing sr, ss and cr. And another span containing only cs. The third span (root span, containing sr + ss from the "outer" call) was the only span that looked correct. See the attached spans.txt file.

After further debugging I could find out that the problem is rooted in the ZipkinTracer.Record method: a SpanMap is used to get an already started span (if any) with the same context as the current record's. The key of the underlying dictionary is the trace context (ITraceContext / SpanState). The problem is that the client side span and the server side span will have an "equal" context (in the sense of Equals/GetHashCode), thus having the same place in the dictionary. The flow of what happens here is basically:

  • ClientSend is recorded. Dictionary has no span for that context --> it is created + added.
  • ServerReceive is recorded. Dictionary finds span for that context --> record is added to span.
  • ServerSend is recorded. Dictionary finds span for that context --> record is added to span. Span is removed from dictionary and reported (triggered because of ServerSend). So a span with cs, sr and ss is reported.
  • ClientReceive is recorded. Dictionary has no span for that context (because it was removed before) --> it is created + added. Span is removed from dictionary and reported (triggered because of ClientReceive). So a span with only cr is reported.

This can only happen if all is done in the same process, of course. I don't know if that's an invalid scenario, but with the Medidata.ZipkinTracerModule it had worked, so I hope we can get it working here, too.

in brave, we fixed this by adding local root to the dict key. local root is the first time this span hit process https://github.com/openzipkin/brave/blob/master/brave/src/main/java/brave/internal/recorder/PendingSpans.java#L167

I have tried to add a test reproducing the problem and a solution attempt in an own fork of the repo: s-stoffel#1

keep in mind same process means loopback RPC. for normal intermediate I think here there is concept of local operation