understand and address concerns with ActiveSpan refcounting
bhs opened this issue ยท 42 comments
Per comments like this one and notes like these from @adriancole's recent distributed tracing workshop, there are justifiable questions about the refcounting happening in OT-Java's ActiveSpan
and its (non-merged) OT-Python equivalent.
If I understand correctly (and set me straight me if I do not), nobody wonders whether there's a need to "activate" (and deactivate) an ActiveSpan in some sort of thread-local, most likely using a scoped language construct like try-with-resources in Java or with
in Python. That said, there are significant concerns about the coupling of the activation/deactivation and decisions about when Spans are actually finished.
OT-Java's ActiveSpan automatically finishes the underlying Span as soon as the last reference is decremented/deactivated. This makes common-case application code shorter and less error-prone since the programmer need not remember to both finish and deactivate ActiveSpans, especially when deferred work (via closures/callbacks) is involved.
The current OpenTracing approach was also designed to allow for other approaches to this problem by factoring out the ActiveSpanSource interface: e.g., one could imagine an ActiveSpanSource adapter that allowed for ref-count-free activate/deactivate semantics like the ones @bogdandrutu described here.
Changing the current API would be painful (and would be done carefully if at all), but getting this right is extremely important (hence the volume of discussion about PRs like OT-py #52, and OT-java #111, #115 (which was merged), #116, and #117).
My hope in creating this issue is that @adriancole @bogdandrutu @palazzem @jakerobb @LotharSee @pavolloffay @objectiser @carlosalberto @clutchski @yurishkuro or others will chime in with documentation of current concerns / missed requirements / etc and we can go from there. If there are other PRs/issues/docs where ActiveSpan troubles have come up, please drop them in here as well.
(fixed title to more accurately reflect intent here)
Main concern summed up is that the current approach is the last thing we'd want to use for anything deemed common-case, as it is flawed and hard to spot how it is broken. Doing so amplifies the support implication of figuring out which thing was called incorrectly then shifted the whole thing out of alignment.
The alternative is being explicit, which in the workshop we found to not notably increase the size of the code, nor its complexity. For example, it replaced awkward calls with explicit ones. Since the explicit ones are safer, it seems a no brainer to start with safe instead of putting in something likely to make implementation unreliable.
My concrete suggestion: rip out the the magic finisher part and keep the ability to implement scoping. Make the scoping thing like others where idiomatic in language (census, grpc, finagle, brave, etc. etc.) Standards usually follow practice, so there's no good reason to not do that.
The magic finisher could be a "use at your own risk" decorating utility, which hopefully wouldn't be used by any serious instrumentation as those writing RPC should do things properly.
@adriancole re:
... the current approach is the last thing we'd want to use for anything deemed common-case, as it is flawed and hard to spot how it is broken
The current ActiveSpan concepts are a translation of the concepts in Dapper's C++ and Java integration libraries. @bogdandrutu accurately pointed out that there are some tradeoffs/problems therein (esp with daemon threads, repeated callbacks, etc), but there were also meaningful benefits when confronted with async control flows and routine callback/closure functions. E.g., the basic refcounting approach here "just worked" for 10K-span traces of web search, etc, even with a programming idiom that was highly async.
I also wanted to point out that OT-Java intentionally supports both "manual" and "active" Span management modalities, and not just for compatibility reasons: the former does no refcounting at all (nor does it register with the ActiveSpanSource / TLS) and is designed for situations where explicit start/finish is preferable.
I think what you're voting for is something in between, where there's an explicit start/finish on the Span as well as a scoped activate/deactivate mechanism that never finishes the Span, correct? That could be as simple as making the refcounting opt-in via an autoFinish
param, a la SpanBuilder#startActive(bool autoFinish)
or ActiveSpanSource#makeActive(Span span, bool autoFinish)
.
@bhs I had ping you and OT team at an sky-walking issue and gitter, but, sadly, I got no response yet. In sw 3.2 iteration, I want to push to support OT-java 0.30, here is the bridge code:
With the NeedSnifferActivation
annotation, means what sw will do in the agent.
Here are two parts of How sky-walking did the supporting for OT-java 0.30.0**.
- For
Continuation
, sw can't support the ref counter, because we don't care about it.
In sw, each trace has multi trace segments, each one of them belongs to one thread only, So when cross-thread happened, we capture a snapshot( something like Continuation
), and continued it in sub thread, even in batch mode.
In the process, we just build the references between two segments (multi thread). That is why sw didn't need counter mechanism.
- For ActiveSpan and Span, sw just consider them as a shell for sw core/real span. The active depends on the context stack, the top one is active, otherwise, it isn't.
I was looking around on github to try and grok anyone's real world experience with ActiveSpan as it is defined today. What I found was that few tracers support it, which makes me wonder how it was tested in any practical sense. This makes it an odd choice to make as a primary api.
https://github.com/search?l=Java&q=ActiveSpanSource&type=Code&utf8=%E2%9C%93
For example, of the maintained tracers, I can only find noop impls ex in lightstep and instana (cc @CodingFabian). Is anyone using this or can't live without it? If this is a niche thing, maybe we can save time discussing by chopping it.
@bhs I think we should consider desired practice at google if we are citing practice at google. census is being developed without this style, by folks at google.
Specifically, I do not want this fake reference counting (counting close is not reference counting) as a part of the core library. IOTW, remove the ActiveSpan api, or move it such that implementing a tracer, spanbuilder etc doesn't require exposing this errorprone type.
concretely, in census, there's a convenience type to close a span and a scope simultaneously. This is a lot different than counting calls to close and assuming a certain count implies finishing the span.
Here's the example from census (which I reviewed). Notably this is a lot less fancy, and therefore easier to support for routine synchronous tracing.
try (Scope ss = tracer.spanBuilder("MyChildSpan").startScopedSpan()) {
tracer.getCurrentSpan().addAnnotation("my annotation");
doSomeWork(); // Here the new span is in the current Context, so it can be used
// implicitly anywhere down the stack. Anytime in this closure the span
// can be accessed via tracer.getCurrentSpan().
}
Specifically, I do not want this fake reference counting (counting close is not reference counting) as a part of the core library
+1. And further advice, Continuation#activate
returns an ActiveSpan, but without any chance to set operationName for this span in initialization stage, I didn't like it. When cross-thread, the spanA in parent thread, don't mean spanA should be active in sub thread, so maybe the tracer implementation will build a new span(spanB) and set a reference (spanB child_of/follow_of spanA). At this point, no operationName is intractability.
Since OT is just an API layer, we didn't supervise Span is active or not, so IMO, the only one Span concept is better.
without any chance to set operationName for this span in initialization stage, I didn't like it
Not sure what you mean by this. BaseSpan (superclass of Span and ActiveSpan) has a setOperationName method on it. You can call it from a continuation or from the original active span.
@jakerobb The continuation#activate
returns an activeSpan, I mean where is from? In sky-walking, we didn't propagate the span in the parent thread, into the child/sub thread, so we have to create a new one. But in here, the user can't set the operationName.
I know the user can call the setOperationName
later, after the span created. But in Tracer
interface, OT asked user to create a spanBuilder by the given operationName, like this:
/**
* Return a new SpanBuilder for a Span with the given `operationName`.
*
* <p>You can override the operationName later via {@link Span#setOperationName(String)}.
*
* <p>A contrived example:
* <pre><code>
* Tracer tracer = ...
*
* // Note: if there is a `tracer.activeSpan()`, it will be used as the target of an implicit CHILD_OF
* // Reference for "workSpan" when `startActive()` is invoked.
* try (ActiveSpan workSpan = tracer.buildSpan("DoWork").startActive()) {
* workSpan.setTag("...", "...");
* // etc, etc
* }
*
* // It's also possible to create Spans manually, bypassing the ActiveSpanSource activation.
* Span http = tracer.buildSpan("HandleHTTPRequest")
* .asChildOf(rpcSpanContext) // an explicit parent
* .withTag("user_agent", req.UserAgent)
* .withTag("lucky_number", 42)
* .startManual();
* </code></pre>
*/
SpanBuilder buildSpan(String operationName);
We are providing two style to create span, I mean this.
continuation.activate doesn't create a new span, though. It gives you an instance of an already-existing span for which you've already provided an operation name.
That is another issue I mentioned, as a spec, we shouldn't ask the implementation to propagate span in parent thread, which is not the way sky-walking did for cross thread.
That is why I say so. Hope it is clear for you.๐
The reason our instana tracer looks like noop is because its actual inner workings are injected at runtime.
We do not use refcounting, and we do not want to use refcounting. We have a proprietary cross thread active span tracking method that is more efficient, because it does not allocate memory for new spans.
In our OT compatibility mode we currently do not use the refcounting, because in all current cases our own mechanism works. should a customer want to use refcounting semantics, we would need to change that, but this would come then with a performance hit.
skywalking did very similar works with instana. @CodingFabian. Agree with you in refcounting.
I can only find noop impls ex in lightstep and instana
LightStep's Tracer uses the ActiveSpanSource provided by the caller or the straightforward OT util
ActiveSpanSource by default.
I will respond to the more important points separately.
Does it make sense to have a bit of a side by side comparison of the proposed approaches? This might make it a bit easier for me to parse.
Re important points:
- Regarding Google/Dapper/Census: my point about Dapper's instrumentation library is simply that the refcounting approach worked well in a mature, complex codebase that involved substantial intrinsic complexity, asynchrony, callbacks, and so forth. Census != "all of Google", and of course neither does Dapper. This is a data point, period, and IMO refutes the idea that the refcounting approach is crazy/broken/indefensible. It is certainly debatable, but that's different than "broken".
- Regarding your specific suggestions, I (ironically given the bickering just above) like them, at least in spirit. I particularly like the idea that the core concepts around ActiveSpan and the reference counting pieces can be decoupled more than they are now.
- Regarding production usage: I am not allowed to specify companies for contractual reasons, but this stuff does run (and operate as intended) in production environments at many brand-name tech companies.
I'll let others chime in with their thoughts on this. I am on paternity leave right now but will try to find time to experiment with ^^^ decoupling and see if there's a path forward that's (a) not terribly disruptive and (b) provides more optionality for callers and Tracer implementors.
Does it make sense to have a bit of a side by side comparison of the proposed approaches?
(Definitely)
FWIW, on openzipkin-contrib/brave-opentracing#43, I have brave-opentracing working with opentracing-api-0.30.0 using the refcounting approach. We have 30+ services instrumented with it and working great.
provides more optionality for callers and Tracer implementors.
+1. I don't like the ref counter, I am not saying it's wrong or didn't work. @bhs . Just want the spec and OT-java lib let the implementation make their own choices.
@jakerobb for disclosure we should say that the implementation had a double-increment bug just fixed. Such a bug is possible with the fake ref-counting (again folks counting close is not ref counting!) and impossible without. All code has bugs, certainly mine does, but this was hard to figure out, right?
@adriancole you are sounding like @RealDonaldTrump
with this "fake ref-counting" FUD! There is, in fact, a threadsafe count of references to an ActiveSpan or its Continuation. When the count gets to zero, something happens. How the count is incremented or decremented is neither here nor there.
Regarding the next steps here, I would like to try to come up with some alternative proposals that will give all OT users (i.e., regardless of Tracer implementation) something like the current approach from a semantic standpoint, as I maintain that there are significant benefits which I can enumerate given more time; and I would also like to give people a way to activate/deactivate a Span without getting into any of the apparently controversial implicit-finish behavior.
@bhs I don't follow the user you mentioned, so not sure how much I sound like him. Can you point me to the code where what you said is true? and that in reality it isn't just counting the amount of times someone invokes deregister?
I see code like this, which is literally counting the times close is called? Or isn't it? FUD implies that I'm saying something that isn't true. Tell me what I don't understand about this code, and that you do understand about it?
public void deactivate() {
if (0 == refCount.decrementAndGet()) {
wrapped.finish();
scope.close();
source.deregisterSpan(wrapped.unwrap());
}
}
--snip--
public void close() {
deactivate();
}
@adriancole the reference count is incremented whenever a Continuation
is capture()
ed and decremented whenever an ActiveSpan
is deactivate()
d (or close()
d, which is a simply a synonym for deactivate()
that's provided for try-with-resources). As documented here, the last ref--
triggers finish()
. I.e., there is a reference count... it increments during capture()
and decrements during deactivate()
(or the synonym, close()
). There is nothing fancy about it, nor is there anything "fake" about it.
Next time I can really sit down with my laptop I will try to move forward with the meat of this issue as I do think there is good reason to make changes per your concerns.
I think part of the confusion is the fact that it isn't really clear how instrumentation is supposed to interact with the continuation model. My understanding of the intent is that the continuation keeps the existing tracer around while work is being done on different threads, allowing other threads to link new spans to it, but those new spans aren't created automatically and a span can be kept open by other threads without anything being added to it.
Anyway, this is a specific area where I think it would be beneficial to have better documentation and examples in addition to a cross implementation compliance test.
I think part of the confusion is the fact that it isn't really clear how instrumentation is supposed to interact with the continuation model.
+1. A tracer implementation can support the interop module for a trace crosses threads, but each implementation has their own way.
From our perspective (Kamon/Kamino teams) the main concern with the automatic closing via refcount is that it mixes two different concerns: in-process context propagation and span life-cycle management.
Having any sort of ref counting in Spans can break behavior without notice. I can totally see a typical Servlet application relying on ActiveSpan auto-finishing and for it to work nice, but as soon as any app/library code starts capturing continuations to be used after the response has been sent to the client or just capturing them and failing to use them, the main request Span might take longer to be automatically finished, if at all. This is my main concern with ref counting as it opens the door for things to break. @bhs mentioned that OT also supports the "manual" mode which doesn't do ref counting nor puts anything in TLS, but in most cases you definitely need to have the Span in TLS for it to properly propagate when RPC/Database/TracedWhatever happen, so, you end up needing a ActiveSpan that doesn't finish automatically. It has already been mentioned that SkyWalking and Instana do not support ref counting and we at Kamon are not supporting it neither because of the reasons exposed above. This might be a hint that something should definitely be done about it.
The ActiveSpanSource/Continuation/ActiveSpan abstractions are quite comfortable to use IMHO, they abstract away where the ActiveSpan actually is (even though most probably it is going to be in a ThreadLocal) and allows for swapping behavior when needed while still providing a simple and expressive API. In that sense, awesome.. for the sole purpose of in-process context propagation these abstractions are nice, but mixing them up with span life-cycle management is like signing up for unexpected behavior and debugging nightmares.
I already asked whether ref counting was a required behavior for tracers and the answer seemed pretty clear, although the description of this issue suggests that it is fine for a tracer not to do this. If it is fine for a tracer not to do it then it should be absolutely clear in the documentation. That being said, my suggestion is to completely remove any mention/implementation of reference counting from the opentracing-java codebase and let the tracers give this functionality to their users if/when it makes sense.
@ivantopo interesting input. One thing I've to ask you is about the difference between something that looks like a good abstraction and something that is a good abstraction. In past work, I've usually found 3 implementations to be the sweet spot where you know the abstraction is starting to hold water. Meanwhile, the ActiveSpanSource, regardless of its likability is lacking in this department. Somehow it is promoted to a core api: you cannot implement a tracer without requiring it. Do you feel it is a good idea to keep such an api as a core api considering the second open source implementation resulted in quite a lot of discussion including a surprise to many that it is in fact hinting at an impl (aka leaky abstraction)?
Do you feel it is a good idea to keep such an api as a core api considering the second open source implementation resulted in quite a lot of discussion including a surprise to many that it is in fact hinting at an impl (aka leaky abstraction)?
I do think that the ActiveSpanSource/ActiveSpan/Continuation API is worth staying in OT, but there have to be some adjustments to ensure these abstractions have a single, sound and well defined responsibility. IIRC the main reason why these abstractions were created was to provide an API for in-process context propagation, but at some point along the way that goal got intertwined with that of writing safer and less error-prone code by ensuring Spans are finished via ref counting and that (ironically) introduces the leakage.. if these abstraction could regard themselves only with in-context propagation then they definitely would deliver on their promise, at least from my point of view.
IMHO this concern gets addressed by limiting the scope of ASS/AS/C to just context propagation and removing any sort of life-cycle responsibilities from them (i.e. ref counting). If any tracer implementation wants to provide it that's fine I guess (although I totally recommend against it) but the specification shouldn't force you to do that. e.g. If the folks at Lightstep are having good results with it that's fine, they can still ship this behavior in their tracer if their users want it, but we shouldn't all be forced to support it nor should the OT specification advertise this sort of vendor-specific behavior in any way. Once again, 3 tracer implementations deliberately deciding not to support it might mean something.
Meanwhile, the ActiveSpanSource, regardless of its likability is lacking in this department. Somehow it is promoted to a core api: you cannot implement a tracer without requiring it.
Having the ActiveSpanSource to be pluggable might be a benefit, but if we think of this realistically, any tracer implementation wanting to be used for real world instrumentation needs to have a notion of ActiveSpan/CurrentContext/SpanInScope (or whatever name fits the purpose).. Not just that, sometimes instrumentation for specific frameworks will rely on semantics of a specific ActiveSpanSource implementation and allowing the user to change that might lead to a misbehaving instrumentation. So, as long as it is not prescribed that the Tracer's ActiveSpanSource should be the only ActiveSpanSource that can be used, I'm totally fine with this requirement since doing any meaningful work will very likely require a ActiveSpanSource anyways.
I'm curious... are there any particular reasons folks are implementing their own ActiveSpan
implementation instead of just using ThreadLocalActiveSpan
?
@tylerbenson in our case, to prevent ref counting from happening.
@tylerbenson, re:
are there any particular reasons folks are implementing their own ActiveSpan implementation instead of just using ThreadLocalActiveSpan
I'm aware of impls tied to MDC specifically, but not in OSS.
@ivantopo, re:
OT also supports the "manual" mode which doesn't do ref counting nor puts anything in TLS, but in most cases you definitely need to have the Span in TLS for it to properly propagate when RPC/Database/TracedWhatever happen, so, you end up needing a ActiveSpan that doesn't finish automatically.
... and also ...
IIRC the main reason why these abstractions were created was to provide an API for in-process context propagation, but at some point along the way that goal got intertwined with that of writing safer and less error-prone code by ensuring Spans are finished via ref counting and that (ironically) introduces the leakage.. if these abstraction could regard themselves only with in-context propagation then they definitely would deliver on their promise, at least from my point of view.
Having had this discussion, ^^^ is a good synopsis of the problem as I see it. At this point I agree wholeheartedly that OT-Java would be better off if a Span could be made active without getting involved with automatic finishing (which is what the refcounting is actually in service of from a feature standpoint). Earlier versions of the ActiveSpan code granted the user control in this area, by the way, and people are welcome to wade through the absolutely enormous PR thread to see why that control was removed; IIRC, it was basically a desire to keep the API surface area as small as possible and there was (and is) strong support for the refcounting idiom as it simplifies instrumentation in many (but not all) cases.
At this point I agree wholeheartedly that OT-Java would be better off if a Span could be made active without getting involved with automatic finishing (which is what the refcounting is actually in service of from a feature standpoint).
I'm quite glad to read that, @bhs. I think the concerns are clearly described in this issue (even though it has been open just for 3 days) and there were a few additional comments regarding this in the (quite) long discussion on opentracing/opentracing-java#115.
How can we help to actually address this issue? Is it plausible to think that reference counting could be removed from OT-java? This would be a breaking change, but one that is worth going through imho.
@ivantopo regarding breaking changes: OT is experiencing it's first major growing pains. In order to address important concepts and move the API forward, we have to get everyone to pay attention and exercise the new API. We didn't have that process down this time; and so this is getting hashed out through breaking changes. That's a bummer, but I think it's okay at this stage to say that we've learned the hard way, and will do better going forwards.
In the future, starting with the next iteration of the Java and Python APIs, we need to build up a process of ensuring the release candidate is properly exercised, and these conversations happen at the correct point in the release cycle. I will be paying a lot of attention to this.
(BTW compatibility/rollout is more related to #78, let's have that discussion there and keep this focused on information gathering and review of the current API).
Is it plausible to think that reference counting could be removed from OT-java? This would be a breaking change, but one that is worth going through imho.
The reference counting per se is not the feature, of course: the feature is automatic Span finish()
ing. How can I say this... I want the automatic Span finishing to be syntactically clean/unburdened and semantically possible for sure without tying the user to a particular Tracer implementation. I also think the people on this thread are all in agreement about the fact that there must be a way to mark a Span as "active" without getting involved with the refcounting (if it continues to exist (or be implied) as part of the core API).
Regarding breaking changes, my first priority is to make sure the API is a good one. Second priority is to make sure that any breakage is communicated, mitigated, etc better than it has been so far... this may involve OTSC literally making PRs to deal with any breakage in major OT impls (e.g., Zipkin/Brave) before the PR lands in OT master. See also #78. And a third priority is not to break the API, of course. But more important to get things right, and the tenor of this issue does not sound like an API that's "right".
I did some prototyping in the past hour and feel optimistic that there's a way to satisfy the various desires here elegantly... People like me would get an auto-finish()ing, ref-counted ActiveSpan adapter/wrapper that can be incorporated regardless of Tracer impl; and those who don't want such a thing can simply choose not to use it. None of the refcounting would fall on Tracer impls.
My "sense" (sic) is that the code is correct; but I need to flesh it out and test in a more meaningful environment than I have had the chance for so far. If it does indeed work, the migration path for existing code that depends on ActiveSpan would be straightforward (basically, when the ActiveSpan is initially created, wrap in this adapter). More later.
I'm curious... are there any particular reasons folks are implementing their own ActiveSpan implementation instead of just using ThreadLocalActiveSpan?
At least in the brave project, this came up to be able to interop with MDC (in various libraries and versions) as well alternate scoped storage (grpc and finagle context integration is requested semi-annually or so). In the future, I would love to have a more powerful option than TLS, similar to what some APMs have probably building on this. Plus, in java we are going to get agents so being able to even have a marker impl would be helpful to prevent agents from redundantly instrumenting code already instrumented.
PS really appreciate things turning around here
For my usage of OT (Java, Brave, and the brave-opentracing bridge, which I've [https://github.com/openzipkin-contrib/brave-opentracing/pull/43](updated to be compatible with OT 0.30)):
I started out by just making BraveTracer wrap an instance of ThreadLocalActiveSpan. That worked for my usage, but received two criticisms:
- It doesn't keep Brave informed about the current active span, nor does it learn of spans made active through Brave's native API.
- It introduces a dependency on an additional library (ot-util).
I reacted to that feedback by duplicating TLAS's behavior into my own BraveActiveSpanSource, wiring in Brave as needed.
I think a better approach might be to provide hooks in both directions so that the impl and OT can keep each other informed about changes to the active span.
I wholeheartedly support separating management of current trace context with support for auto closing. That said, I am using and would prefer to continue using both.
One thing not explicitly mentioned in OT spec is the implicit stack of current spans. Nearly every span has a parent, and that relationship is established specifically by the spec for creation -- if there is a current active span when you create a new span, the new one will become a child of the current. Upon finish of that child span, the spec does not mention that the parent should be made active again, but to do otherwise completely breaks the paradigm. As such, ActiveSpanSource impls need to maintain a (thread local) stack, and that stack has to be pushed and popped at all of the right times, but OT doesn't acknowledge that.
(Sorry if this is incoherent; I'm commenting from my phone on my couch with my three year old interrupting every eight seconds! ๐คฃ)
Alright, I put together opentracing/opentracing-java#166 as an illustration. Not 100% complete, but tests pass, etc, and it hopefully shows how the refcounting can be pulled out from the rest of the ActiveSpan machinery.