Running multiple tracing implementations simultaneously
Opened this issue · 22 comments
Continuing the discussion from opentracing/opentracing.io#66
Higher Order Tracers (HOT) & Migrating Between Vendors
I think there's a case for Higher Order Tracers i.e. Tracers that combine other tracers (at construction time) e.g.
import ot.hot
...
zipkinTracer = ZipkinTracer()
logTracer = LogTracer()
proprietaryTracer = ProprietaryTracer()
opentracing.tracer = hot.All(zipkinTracer, logTracer, proprietaryTracer)
Any in-band keys should be namespaced in order to enable >1 tracers working in parallel - I think this (impressive) might be the only extra constraint that has to hold true to enable HOTs. There will be the need to add some "magic" for handling error cases (i.e. vectorized)... but shouldn't be huge deal. Obviously there could be others like hot.Any
or router-like i.e. depended on implementation specific run-time flags, span Attributes etc.
Something like this would allow one to switch vendors without downtime since it allows a transition period (could be days) while both the old and the new tracing systems are running. As far as I can see, it should be close to impossible to switch with 0-downtime unless you have such feature.
Hopefully different tracers would be using different header keys anyway, so I would punt on the namespacing, especially since introducing namespacing is also incompatible with the existing deployment.
@yurishkuro - I agree... no "formal" namespacing. Just that people should be aware of aliases when they design their protocols. One interesting corner-case is baggage where I can imagine
implementations that just normalize the key and push it directly to the carrier. If all implementations so happen to naively not change the value, this might be ok. Generally speaking bad things could happen - specially in the context of C++ where it could cause a crash.
From the previous discussion...
yes, the attributes is the only read operation in the whole API, so I think the composite tracer will have to pick one of the delegates, i.e. always the first, and the end users can configure delegate tracers in the proper order.
I think the only mathematically correct (i.e. functional) way to do this, is to not discard any information i.e. something like span.get_baggage_item(key)
should return {'value1':['ZipkinTracer', 'LogTracer'], 'value2': ['ProprietaryTracer']}
and let the app do whatever seems fit. Personally I would be more than happy with that but I'm sure most people would freak out. Could it return the first (non-null) majority and call an error
callback-function mentioning "discarded value XYZ from that tracer?". Of course it could return first non-null but first
doesn't sound very well defined :) Another option is to pass an (optional) lambda iteratable: ...
to get_baggage_item()
that lets the app hand-craft its policy (this makes more sense with functors in C++).
The same thinking applies on the return/error codes.
if you assume that the tracers already use prefixed keys, then encoding of the baggage is no different from encoding of the span context.
Errors are harder. I would go with a configuration option for the multiplexing tracer to either fail fast or to do best effort and aggregate all errors into one.
Let's assume that I upgrade from a tracer that doesn't support baggage propagation to one that supports (or the reverse). I think this use case, could easily hint towards a good solution.
I may have misread your point about baggage - so no issues for encoding/decoding, but just what to return from MultiplexinsSpan.getBaggage - I would go with a simple solution and return whatever the first tracer returns, and a more complicated solution is again internally configurable strategy in the tracer on what to do in case of conflicts (I wouldn't worry until there's a real need).
No - you are correct. There were two issues on baggage, one is the one you said above, about the prefixed-keys. i.e. if someone sets "hello": 12
as baggage, e.g. X-B3-hello:12
should go to the Carrier
. There's now the second issue of what MultiplexinsSpan.getBaggage()
returns (which should be the same with Span.getBaggage()
in terms of type, otherwise we break the O(1)
replace implementation rule). There are are also subtleties like what happens if we have 3 Mux'ed Spans and the second throws KeyError
... I think the Mutiplexer should mark it but continue and try to set on the third Mux'ed Span. Similar interesting cases exist on the Tracer.extract()
and Tracer.inject()
. We might be able to restore a MultiplexinsSpan
with Mux'ed Spans for tracer 1 and 3 but not for 2 (e.g. due to UnsupportedFormatException
). MultiplexinsSpan
should optimistically survive, but the partial error shouldn't go completely un-noticed. I think that this discussion is quite crucial on the essence of OT and no matter how much pain, there should be a good answer. If OT is about switching between implementations it should be possible to do so without downtime... otherwise only relatively small companies can afford it.
If the mux tracer can be configured with error handling strategies like FAIL_FAST and ACCUMULATE, I think it mostly solves the issue with extract(). I don't think there's a way to define the "correct" behavior of the mux, so give users a choice to pick the behavior and do best effort.
There's probably a need for a 3rd error strategy - to succeed if at least one tracer succeeds, because the user code might be checking the error and not using the returned span context as parent in case of an error.
What are your thoughts on ACCUMULATE
?
There's probably a need for a 3rd error strategy - to succeed if at least one tracer succeeds, because the user code might be checking the error and not using the returned span context as parent in case of an error.
I agree on this a lot. Overall you want to be able to upgrade/downgrade gracefully in terms of Tracer features. If baggage or a specific type of carrier wasn't supported and it's supported with the new tracing system I guess you want to print an DEBUG
/INFO
log and continue using as many Tracer features as possible. If no tracer supports a feature, then an ERROR
should be logged and someone look at it ASAP... because e.g. you are using a baggage value that is now e.g. null
.
Bringing #28 (comment) here...
I think this approach will help upgrade or switch systems, though in practice it might require complete deployment across the entire architecture before you switch off the former.
Yes exactly. There should be some kind of metric and you can't switch off till you see no-hits.
Key concerns will be around aggregate (in and out of process) propagation and trace context provisioning I think.
I think that the first concerns mostly what we talk about above. I'm not quite clear on the second one...
Error handling strategies should be the very important thing in multi-tracer-implementations. We should choose them carefully. If use the wrong strategy, a trace-implementation will failure, because of other implementation.
Such as: extract in tracerA.extract() -> tracerB.extract() -> tracerC.extract()
, if tracerB.extract()
fails, the strategy should make sure tracerC.extract()
must be invoke. Otherwise, tracerC.context() will be missed from there. And the trace-C will be broken.
This bug will be hard to find out from the tracer-user-side or tracer-deveploer-side.
enable >1 tracers working in parallel
, I think parallel
is not the accurate word, maybe. parallel
means execution can be truly simultaneous, also almost means run in multi-threads, this is dangerous choice.
Such as: extract in tracerA.extract() -> tracerB.extract() -> tracerC.extract()
Most likely, multi-tracers execute like a chain as I said above. Of course, with some Error handling strategies.
@yurishkuro , I prefer the strategies can isolate the tracer-implementations, we could provide listener
to notify user, there are errors or exceptions in some of the multi-tracers.
@lookfwd sorry for the delay, having a crazy week.
The primary value prop around interoperation of Tracers is not actually so that an org can necessarily switch from one tracing vendor to another with zero hiccups, although that's another nice possible benefit :)
The main advantage I see is actually for 3rd-party OSS/etc projects that are used by many organizations and need to provide tracing instrumentation without making assumptions about the environments they're integrated into. E.g., take DropWizard: there are companies using DropWizard and Jaeger; companies using DropWizard and Zipkin; and companies using DropWizard and LightStep. By instrumenting DropWizard with OpenTracing, the vendor-neutrality is preserved for at the DW level and the vendors don't all need to independently instrument DW.
As for the idea of informally namespacing headers: sure. Though IMO this is redundant to discuss at the OT level since that's true for any possible HTTP header (given that HTTP headers are a shared namespace to begin with). I don't strongly object to making the docs longer / more complex to accommodate this advice, but I am skeptical that it's needed to achieve your desired result :)
The main advantage I see is actually for 3rd-party OSS/etc projects...
@bensigelman Awesome - that wasn't clear for me
without making assumptions about the environments they're integrated into. E.g., take DropWizard: there are companies using DropWizard and Jaeger
Aren't they making the assumption there's only a single tracing framework? How can one use DropWizard with both Jaeger and Zipkin?
Frankly, adding instrumentation is adding quite some non-trivial code on a potentially large (e.g. N>>1000) number of applications, and hopefully their instrumentation code might stay in place (without the need for any changes) potentially for 5+ years. I believe that an application developer wants to make the most out of the added instrumentation code which means it's quite likely one might like to be running solutions from 2-3 different vendors + a few other potentially irrelevant smaller tracing-like applications that implement a limited part of OT API.
Aren't they making the assumption there's only a single tracing framework?
It's a pretty reasonable assumption, but it's not a concern of Dropwizard instrumentation. If someone manages to make a Multiplexing tracer, as discussed on this thread, then a Dropwizard user can use two tracers simultaneously, and DW doesn't need to change anything in its instrumentation.
How can one use DropWizard with both Jaeger and Zipkin?
I don't think Ben was suggesting that. Unless OpenTracing 2.0 comes up with a common in-band format for context propagation, even if you use multiple tracers each one of them will still see only the slice of architecture that directly runs with that tracer.
If someone manages to make a Multiplexing tracer
Exactly - the question is - does OT impose some constraints that prevent such a tracer from being possible? My guess is that we wouldn't like that.
For example, as we mentioned previously, span.get_baggage_item(key)
is assumed to either succeed or fail but not partially fail. I agree that in general this is Multiplexing tracer's concern mainly but it's possible that e.g. DW would be interested on the distinction between span.get_baggage_item(key, mode=strict)
which fails if two tracers disagree for an "important" baggage item and the default span.get_baggage_item(key)
that returns the first non-null
baggage from the child-tracers. Similarly for partial success/failure of inject()
etc.
P.S. I realize that it's mostly talking and thinking here. I think I will try to write/plug a mixing tracer next time I have time to play with Zipkin-Python-OT