w3c/trace-context

Randomness flag bit

Closed this issue · 7 comments

Based on the OpenTelemetry work related to sampling traces using a consistent head sampling rate, @jmacd @bogdandrutu and @oertl would like to make a proposal:

In OpenTelemetry we need to know if (and how many) bits from the TraceId are randomly generated. In the open-telemetry/oteps#168 we are proposing to use an "r" value that encodes how many bits are random from the trace-id, but this can be easily encoded using the trace-flags, and this information can be used by other libraries/implementations to do "logs" sampling or any other context specific samplings.

The proposal is to use 1 bit from the TraceFlags that encodes the following information:

  • If "0" it means that the "randomness" of the trace-id is unknown - this is for backwards compatibility.
  • If "1" it means that the right most 63 bits (this needs to be better specified) are randomly generated and can be used to calculate sampling probabilities. e.g. in case of a trace-id="123456788765432153ce929d0e0e4736" from the right most 16 characters if converted to a binary: 53ce929d0e0e4736 -> "101001111001110100100101001110100001110000011100100011100110110" - all bits are random except the left most bit.

Why 63 and not 64?

The main motivation for this is because:

  • Languages like golang have a native implementation to generate 63bits random numbers (see https://pkg.go.dev/math/rand#Int63)
  • Languages like Java do not have support for unsigned numbers, which makes using the 64th bit very hard when converting to a number in order to calculate probabilities.

The specification can ensure 64bit but will be overkill and unnecessary.

Related/Simplification: This will resolve/help with a large part of #463

It would be absolutely fabulous if TraceContext began supporting some amount guaranteed randomness. This is also a good time to do it, as many organizations which produce these IDs are still in the process of adopting this header.

Referencing a related issue, OTel is considering using the remaining bits as a timestamp: open-telemetry/opentelemetry-specification#1947

But as I mentioned in that issue, what is the possibility that we will want ~128 bits of randomness in the future? I would suggest that we don't assume we can easily bump up the amount of randomness in a later version of the spec.

I think it is worth mentioning that in order to use this flag on 64 bit systems as described in #349 essentially the whole trace ID would be required to be random. The spec requires systems like this to use the right-most bytes for their short trace ID, which are the same bits you are proposing to use. This requirement is currently non-normative so it may be ok https://www.w3.org/TR/trace-context/#handling-trace-id-for-compliant-platforms-with-shorter-internal-identifiers

The original requirement for 63 bits was for a specific sampling requirement to be able to represent probabilities as low as 1^-63 (1 in 9x10^18). Was 63 specifically chosen because 2^-31 (1 in 2x10^9) is not sufficiently small? I ask only as a counterpoint to @tedsuo's comment and to make sure the rationale behind choosing 63 specifically is sufficiently considered and documented.

oertl commented

@dyladan There was never a requirement to support sampling rates as low as 2^-63. It is rather a result of the number of bits that are used to encode the number of leading zeros (NLZ) of a uniform random number. If all sampling rates are powers of 1/2, it is sufficient for consistent sampling to propagate the NLZ only. If 5 bits are used for encoding the NLZ, the minimum supported sampling rate would be 2^-31 which might not be sufficiently small. Therefore, we proposed to use 6 bits for encoding the NLZ, which resulted in a minimum sampling rate of 2^-63. Hence, we could also think of having for example just 48 random bits supporting sampling rates >= 2^-47.

@bogdandrutu, @jmacd Probably a minor issue, but dependent on the implementation and the used random number generator, generating a certain number of high-quality random bits could be more costly than generating the NLZ directly as in the alternative proposal (#463) which needs only 2 random bits on average and potentially allows to use the bits of one 64-bit random value for multiple traces. Maybe also interesting in this context, a quality comparison of fast common pseudo-random number generators https://github.com/lemire/testingRNG#visual-summary.

@dyladan

I think it is worth mentioning that in order to use this flag on 64 bit systems as described in #349 essentially the whole trace ID would be required to be random. The spec requires systems like this to use the right-most bytes for their short trace ID, which are the same bits you are proposing to use. This requirement is currently non-normative so it may be ok https://www.w3.org/TR/trace-context/#handling-trace-id-for-compliant-platforms-with-shorter-internal-identifiers

This is correct, but so far the most system that I know they are/were using 64-bit trace-id are Zipkin and Jaeger which both use a "fully random" trace-id, which means they will be able to set the flag in both cases if they want to make use of the flag in their environments.

The original requirement for 63 bits was for a specific sampling requirement to be able to represent probabilities as low as 1^-63 (1 in 9x10^18). Was 63 specifically chosen because 2^-31 (1 in 2x10^9) is not sufficiently small? I ask only as a counterpoint to @tedsuo's comment and to make sure the rationale behind choosing 63 specifically is sufficiently considered and documented.

This is based on my experience (system I designed/wrote or maintained), I saw systems like Google using up to 2^20 sampling probability so they need 20 random bits at least, also I've seen other algorithms that may require more random bits. If the concern is that we are asking for "too" many random bits, I am happy to accept other proposals like 31 or 47?

@oertl

Probably a minor issue, but dependent on the implementation and the used random number generator, generating a certain number of high-quality random bits could be more costly than generating the NLZ directly as in the alternative proposal (#463) which needs only 2 random bits on average and potentially allows to use the bits of one 64-bit random value for multiple traces. Maybe also interesting in this context, a quality comparison of fast common pseudo-random number generators https://github.com/lemire/testingRNG#visual-summary.

Adding new bytes will cause a version update and will be backwards incompatible, so I think it is a huge disadvantage compared to the proposed solution which does not require a new version of the header.

@tedsuo

But as I mentioned in that issue, what is the possibility that we will want ~128 bits of randomness in the future? I would suggest that we don't assume we can easily bump up the amount of randomness in a later version of the spec.

Nothing can stop us to add in the future another bit that says if the first 65-bits (or how many we may need) that has the same behavior as the proposed bit. Personally I see no reason to have this, and cannot find any use-case. Also that will definitely make the "64 bit systems" to not work with this proposal, see https://github.com/w3c/trace-context/blob/main/spec/60-trace-id-format.md#interoperating-with-existing-systems-which-use-shorter-identifiers

During the WG meeting today it was brought up that some tracing systems use non-random trace ids. @SergeyKanzhelev mentioned Apache SkyWalking as one of those systems so I took a look at their javascript agent to see what they're doing.

Their model is slightly different because each component in the trace has a "segment" and each segment may have multiple spans.

In their documentation they have:

Their header is composed of multiple fields concatenated with -:

  1. Sample. 0 or 1. 0 means that the context exists, but it could (and most likely will) be ignored. 1 means this trace needs to be sampled and sent to the backend.
  2. Trace ID. String(BASE64 encoded). A literal string that is globally unique.
  3. Parent trace segment ID. String(BASE64 encoded). A literal string that is globally unique.
  4. Parent span ID. Must be an integer. It begins with 0. This span ID points to the parent span in parent trace segment.
  5. Parent service. String(BASE64 encoded). Its length should be no more than 50 UTF-8 characters.
  6. Parent service instance. String(BASE64 encoded). Its length should be no more than 50 UTF-8 characters.
  7. Parent endpoint. String(BASE64 encoded). The operation name of the first entry span in the parent segment. Its length should be less than 150 UTF-8 characters.
  8. Target address of this request used on the client end. String(BASE64 encoded). The network address (not necessarily IP + port) used on the client end to access this target service.

Note that does not address #463 because it makes the operator choose between using externally generated ids (sometimes needed for legacy/compatibility reasons) and having randomness in the context.