micrometer-metrics/tracing

The value of parentId depends on sampling

avnersin opened this issue · 1 comments

Describe the bug

We are using micrometer with otel in a spring boot application.
We have a test where we call various rest endpoints (without any input trace headers) and capture trace information on the server side.
The test has an assertion that the current parent span id (tracer.currentSpan().context().parentId()) is always null.
This assertion should be correct as we do not have any nested spans in this test.
However, intermittently we see a non null parent span id.
This appears to be affected by the sampling configuration. If we set management.tracing.sampling.probability to 0.0 then the issue is not reproducible and the parent span id is always null in this test.

Environment

  • Spring Boot 3.2.5
  • micrometer-tracing 1.2.5
  • OS: macOS
  • Java version: 17

To Reproduce

tracing-test.zip

Attached is a simplified test that reproduces the issue.
Run mvn clean install or run TracingTest from your IDE.
The test runs 100 iterations of calling a rest endpoint that returns the current traceId, spanId and parentSpanId on the server side.
The assertion that the parentSpanId is null fails on an arbitrary iteration.
If we uncomment line 18, that is - if we set management.tracing.sampling.probability to 0.0 then the test does not fail.

Expected behavior

It is counter intuitive that the result of tracer.currentSpan().context().parentId() will be different for different invocations of the exact same flow. We expect to get in this flow a null span parent id, regardless of sampling.

Additional context

In the attached simplified test we could only reproduce a "0000000000000000" parent span id. In our real test (that we cannot attach here), we see "real" parent span ids. So the test attached here does not exactly reproduce the issue we see in our real test. However, in our real test we also see that the issue is not reproduced if we set management.tracing.sampling.probability to 0.0. Therefore, it looks like the cause of the issue in both tests is the decision to record a particular span which is causing a different behaviour with respect to the parent span id value.

I was able to reproduce this using purely OTel, here is a much simpler reproducer:

// using io.opentelemetry.api.trace.Tracer

@Bean
CommandLineRunner commandLineRunner(Tracer tracer) {
	return args -> {
		Span span = tracer.spanBuilder("test").startSpan();
		try (Scope ignored = span.makeCurrent()) {
			if (span instanceof ReadableSpan readableSpan) { // if the span is sampled				System.out.println(readableSpan.toSpanData().getParentSpanId());
			}
			else { // if the span is not sampled
				System.out.println(span.getClass().getName());
			}
		}
		finally {
			span.end();
		}
	};
}

You were right the behavior depends on sampling. If you add the above to your app, and enable 100% sampling in application.properties:

management.tracing.sampling.probability=1.0

you will get an "invalid" parentId from OTel: 0000000000000000 (since there is no parent).
But if you disable sampling:

management.tracing.sampling.probability=1.0

The Span that OTel returns is a PropagatedSpan which is not a ReadableSpan ( and it is not public either).

I reached out to OTel maintainers, this might be a design flaw in OTel (I'm not sure if this is intentional or not at this point). If this cannot/will not be fixed in OTel (and for earlier versions of Micrometer Tracing), I think we might be able to return the invalid id in OtelTraceContext if OTel does not return a ReadableSpan.