opentracing/specification

Clarify the usage of the `error` tag

yurishkuro opened this issue ยท 3 comments

Currently the error tag is defined as following:

true if and only if the application considers the operation represented by the Span to have failed.

I wonder if it makes sense to provide additional clarification what is meant by "failed". For example, if a span represents an HTTP endpoint that returned a 400 or 500 status code, did it "fail" in both cases or only in the latter? For instance, the definition of "availability" in the Google SRE book counts as failures only when the server fails to respond to a "good" client request, so a 400 response is not a server failure. On the other hand, the span that represents a client request should "fail" with both 400 and 500 responses (imo).

Another approach could be to replace the boolean tag with an enum/code indicating if the error was the caller fault or the server fault, but it seems by using the interpretation above it may not be necessary.

The reason this is important is because a lot of existing OT instrumentation is too eager to mark the span with error=true, which complicates the automated blame analysis.

@opentracing/otsc @opentracing/otiab

@yurishkuro Agree it would be good to be able to distinguish client and server errors, however I think client errors still need to be recorded in someway on the server span - e.g. to be able to detect a high percentage of invalid client requests that may indicate some form of attack?

The service may be the initial entry point, with no client span to record the failure - in which case the server span may be the only record of the fact that a failed attempt occur.

We could use the http.status_code and infer the client error - but may be better if there was a protocol/transport independent tag to enable analysis of such situations.

I agree that some clarification is in order. The convention says it should be based on whether the application considers the operation to have failed. Perhaps the clarification should be an admonition that the response from the server does not necessarily indicate whether the client application should consider the span to be in error. It is just one input to that decision.

The status code already says whether the responder blames itself (5xx), the client (4xx), or something else. Setting error=true based only on the HTTP response code is not adding information, and potentially adds noise. Whether a response is truly an error from the client perspective is subject to context. Certainly 401, 402, 403, 404, 409, 410, 423, 451, etc. can reflect normal and potentially frequent behavior in some systems, but errant behavior in others. Similarly, a client expecting 200 could consider it an error to receive 204 or 205, but that could be perfectly normal to the client of a REST service. The situation even becomes more fraught in the presence of something like linkerd producing response codes for requests that never made it to another server, or perhaps that were retried to multiple servers.

It could make sense for a client to tag error=true automatically in situations where either the reqest was known not to have been sent, or the client believes the request was sent but didn't receive a response, or perhaps the response was so broken or malformed that not even an HTTP code is available (failed SSL negotiation could fall into this category). In those cases, the client span is likely to be the only record of the failure.

@objectiser, attackers are (sometimes) intelligent, so although a spike in invalid client requests can indicate an attack, a spike in valid requests can too. A spike in 201 or 202 could indicate an attack with worse consequences than a spike in 404. Detection I think will be more successful if it is at the least based on the actual status code.