What should we be specifying?
Opened this issue · 9 comments
Opened up a conversation in Gitter, was recommended by @clutchski to open up an offline, trackable discussion here.
We have a ton of different systems, implemented in all manners and languages (go, c, c++, perl, python, java, scala, elixir, erlang). You name it, we have it.
We believe that the main value to opentracing has to do with creating an industry standard around the following:
- Inter-process propagation. This can be by any means (HTTP being the most common). Standard way to pass the payload so to speak. This includes all shapes and manners of flags in addition to baggage (or even if baggage goes inter-process)
- "Wire" format for the actual span data so it can be easily consumed. For example, in XML there would be an XSD. In JSON, we can provide a schema for guidance. Same with avro.
- Sampling implications - there is the standard "shouldSample: true | false", but there may be the need for dynamic sampling algorithms. I would love to explore these use cases and that would help pound out some of the additional inter-process propagation stuff
Once you have a standard way to propagate through all of the things, and you have a standard way to "emit" spans so they can be analyzed / stored, then the world opens up for all kinds of systems to be built up from that standard.
A little background, just conveying these thoughts wrt "we should standardize on in-process propagation", which imo is less valuable than the above.
For an instrumentation library to be widely supported, it would have to have "adapters" to each of the inbound/outbound propagation formats (as well as emission formats for sending actual span data). If we all talked the same way, then I could write my library once.
Tracing is not all that difficult, there are just subtle rules to play by.
For example, it is great if I have a java web server and can use a jax-rs or servlet filter to parse the incoming trace context.
For a smaller system / microservice, I may have a few endpoints, talk to one backend and a database. It is very trivial in that instance to do my own tracing thing.
If I am writing an akka-http application, I might have a custom directive for parsing headers. For a finch application, similarly easy to have an Endpoint I can compose to pull out the trace.
My point is, I can imagine a thousand small libraries, many of which overlap with slightly different usage. Having a standard interchange format and output format (or rules) makes the development of those libraries easier while at the same time facilitating the rest of the ecosystem around collectors, aggregators, storage layers, visualization, and applications.
In our TracingPlane work, we developed a specification for general purpose metadata propagation. Specifically:
- a general-purpose serialization format / representation called atoms
- general-purpose logic for merging atoms in their serialized form (e.g., if you have two contexts you need to combine)
- a protocol called the 'baggage protocol' for serializing arbitrary data structures into atom form (including sets, maps, and tree-structured data)
What this gives you is the ability to propagate other peoples' baggage opaquely through your system without needing to interpret that baggage.
Looking forward into the future, it might be prudent to go with a general purpose metadata format which can contain the OpenTracing-specific fields. It means future versions of the OT spec can add new fields and flags while remaining compatible with older versions at the context-level.
Alternatively, using a general-purpose metadata format like this would be a way to enable individual OT-compliant backends to implement different contexts and propagate different fields while still standardizing on the underlying format.
@paulcleary thanks for raising this issue. Will be fun+interesting to see the various opinions.
Historical context: OpenTracing began life (even before it had its current name) as an attempt to standardize parts of Span representation. Given feedback from users trying to integrate in various contexts, the priority shifted towards an API for literally describing the semantics of application behavior, rather than the representation of those semantics. There are more details about the background for that prioritization here.
That said, I should be clear that I would personally like to see standardization of, well, everything... who wouldn't, given the choice? Regarding in-band propagation formats, interested parties should absolutely have a look at https://docs.google.com/document/d/1Mrw7hxVAkj7h98hvgRixDt1RrEJ5fsqqddPofIHRVvI/edit#heading=h.eidw2m3e407w . Regarding out-of-band propagation formats, you probably already know about zipkin's format... Google has its own out-of-band span representation, and xray just published theirs the other day: https://docs.google.com/document/d/1Mrw7hxVAkj7h98hvgRixDt1RrEJ5fsqqddPofIHRVvI/edit#heading=h.eidw2m3e407w ... and of course there are out-of-band formats for the N other tracing systems out there. One observation I'd make is that, while the various formats are certainly similar, they are not 1:1 semantically, and often for fundamental reasons... i.e., the differences in out-of-band formats often reflect feature or use-case affinities.
@JonathanMace as far as standardization is concerned, I am a big fan of the level of abstraction in the tracingplane work. If a representation standard is actually going to work across tracing systems, it would certainly need to be quite general.
OpenTracing is some high-level concepts, with a collection of standards.
Language
Currently, the standards are facades for each language: Java, Go, JS, etc.
Inter-process propogation
We could say that we want to add standards for inter-process propagation, like. It would be AWESOME if I could create traces at my load balancer...that would make it very, very to add tracing to a system. But given that LBs are basically all inter-process, standardization could only happen if we agreed on headers, just like everyone agreed on X-Forwarded-For, etc.
Reporting
Consider other aspects of runtime monitoring:
For logging, I can use syslog protocol to send my data to rsyslog, Loggly, Splunk, etc.
For metrics, I can use StatsD protocol for Graphite, Datadog, Sysdig, etc.
There is a lot of value in being about to make these choices at an operational level.
Given that a lot of OT folks use SOAs, I think there's a lot of understanding that inter-process abstractions are very helpful, both for propagation and reporting.
OT could be an umbrella that defines both: language facades and inter-process. If you have custom protocol X the common facade is helpful. If you have custom runtime Y (say, a config-based reverse proxy), the common protocol is helpful.
I don't know if OT is the right place to define inter-process standards, but certainly there is a large need to do so, and we as a community of interested developers need to figure this out.
(cc @adriancole who is a big advocate for standardization of in- and out-of-band propagation formats)
I think we'all could define both standards.
I also don't think there is anything fundamentally wrong with defining an inter-process standard and then backing into it via an evolution process.
There are plenty of us folks that can "start from scratch", I would say the majority of companies out there are green-field. Starting with a standard would be extremely valuable to them.
OpenTracing began life (even before it had its current name) as an attempt to standardize parts of Span representation.
Side note: OpenTracing has a leg up on any other efforts just because of the name.
@pauldraper much more specific than "money" - https://github.com/Comcast/money
This just reminded me to update the README and docs
We tried to add a bit more prescription on reporting format and propagation format in https://github.com/Nordstrom/ctrace. The idea for us was to decouple the client sdks reporting from the backend implementation collecting and providing visualizations. It's becoming even more clear that propagation format standardization is extremely important for getting complete end-to-end traces including all third parties, appliances, etc...