StackOverflowError on Span end when zipkin is sampling

Question

StackOverflowError on Span end when zipkin is sampling

Pgaze opened this issue 2 years ago · 14 comments

Describe the bug
Using version 3.1.3 of spring-cloud-starter-sleuth-zipkin, a StackOverflowError sometimes happen when ending a Span.

Stacktrace

java.lang.StackOverflowError: null
	at java.base/java.lang.ClassLoader.defineClass1(Native Method)
	at java.base/java.lang.ClassLoader.defineClass(Unknown Source)
	at java.base/java.security.SecureClassLoader.defineClass(Unknown Source)
	at java.base/java.net.URLClassLoader.defineClass(Unknown Source)
	at java.base/java.net.URLClassLoader$1.run(Unknown Source)
	at java.base/java.net.URLClassLoader$1.run(Unknown Source)
	at java.base/java.security.AccessController.doPrivileged(Unknown Source)
	at java.base/java.net.URLClassLoader.findClass(Unknown Source)
	at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
	at org.springframework.boot.loader.LaunchedURLClassLoader.loadClass(LaunchedURLClassLoader.java:151)
	at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
	at brave.internal.handler.NoopAwareSpanHandler.end(NoopAwareSpanHandler.java:59)
	at brave.internal.recorder.PendingSpans.expungeStaleEntries(PendingSpans.java:133)
	at brave.internal.collect.WeakConcurrentMap.getIfPresent(WeakConcurrentMap.java:50)
	at brave.internal.recorder.PendingSpans.get(PendingSpans.java:59)
	at brave.Tracer.swapForPendingContext(Tracer.java:197)
	at brave.Tracer.toSpan(Tracer.java:381)
	at brave.Tracer.toSpan(Tracer.java:376)
	at brave.LazySpan.span(LazySpan.java:141)
	at brave.LazySpan.context(LazySpan.java:40)
	at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:49)
	at org.springframework.cloud.sleuth.instrument.prometheus.SleuthSpanContextSupplier.getSpanId(SleuthSpanContextSupplier.java:52)
	at org.springframework.cloud.sleuth.instrument.prometheus.LazySleuthSpanContextSupplier.getSpanId(LazySleuthSpanContextSupplier.java:48)
	at io.prometheus.client.exemplars.DefaultExemplarSampler.doSample(DefaultExemplarSampler.java:46)
	at io.prometheus.client.exemplars.DefaultExemplarSampler.sample(DefaultExemplarSampler.java:34)
	at io.micrometer.prometheus.PrometheusCounter.updateExemplar(PrometheusCounter.java:79)
	at io.micrometer.prometheus.PrometheusCounter.increment(PrometheusCounter.java:58)
	at zipkin2.reporter.metrics.micrometer.MicrometerReporterMetrics.incrementSpans(MicrometerReporterMetrics.java:111)
	at zipkin2.reporter.AsyncReporter$BoundedAsyncReporter.report(AsyncReporter.java:252)
	at zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:44)
	at zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:29)
	at zipkin2.reporter.brave.ZipkinSpanHandler.end(ZipkinSpanHandler.java:148)
	at brave.internal.handler.NoopAwareSpanHandler$CompositeSpanHandler.end(NoopAwareSpanHandler.java:107)
	at brave.internal.handler.NoopAwareSpanHandler.end(NoopAwareSpanHandler.java:57)
	at brave.internal.recorder.PendingSpans.expungeStaleEntries(PendingSpans.java:133)
	at brave.internal.collect.WeakConcurrentMap.getIfPresent(WeakConcurrentMap.java:50)
	at brave.internal.recorder.PendingSpans.get(PendingSpans.java:59)
	at brave.Tracer.swapForPendingContext(Tracer.java:197)
	at brave.Tracer.toSpan(Tracer.java:381)
	at brave.Tracer.toSpan(Tracer.java:376)
	at brave.LazySpan.span(LazySpan.java:141)
	at brave.LazySpan.context(LazySpan.java:40)
	at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:49)
	at org.springframework.cloud.sleuth.instrument.prometheus.SleuthSpanContextSupplier.getSpanId(SleuthSpanContextSupplier.java:52)
	at org.springframework.cloud.sleuth.instrument.prometheus.LazySleuthSpanContextSupplier.getSpanId(LazySleuthSpanContextSupplier.java:48)
	at io.prometheus.client.exemplars.DefaultExemplarSampler.doSample(DefaultExemplarSampler.java:46)
	at io.prometheus.client.exemplars.DefaultExemplarSampler.sample(DefaultExemplarSampler.java:34)
	at io.micrometer.prometheus.PrometheusCounter.updateExemplar(PrometheusCounter.java:79)
	at io.micrometer.prometheus.PrometheusCounter.increment(PrometheusCounter.java:58)
	at zipkin2.reporter.metrics.micrometer.MicrometerReporterMetrics.incrementSpans(MicrometerReporterMetrics.java:111)
	at zipkin2.reporter.AsyncReporter$BoundedAsyncReporter.report(AsyncReporter.java:252)
	at zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:44)
	at zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:29)
	at zipkin2.reporter.brave.ZipkinSpanHandler.end(ZipkinSpanHandler.java:148)
	at brave.internal.handler.NoopAwareSpanHandler$CompositeSpanHandler.end(NoopAwareSpanHandler.java:107)
(...)

This only happens when the trace is sampled by zipkin, since error has not been reproduced with spring.sleuth.sampler.probability set to 0.

Looks like the Span being ended is also one of the staleEntries to expunge by brave.internal.recorder.PendingSpans.
Context is the end of the treatment of Kafka consumer that leads to other messages being produced with the same traceId if that helps.

Answer 1 · 2023-03-14T15:47:05.000Z

That's interesting! Do you have a reproducer by any chance?

Answer 2 · 2023-03-14T15:49:34.000Z

Not at the moment, I will try to make one

Answer 3 · 2023-03-16T08:25:43.000Z

Unfortunately, I have not been able to reproduce with a simple application. We will try to set sampler propbabilty to 1 later in our sprint to verify if it happens randomly or every time Zipkin is sampling. I'll keep you updated with our results

Answer 4 · 2023-03-20T14:55:19.000Z

We manage to do the test today on our preprod. Setting spring.sleuth.sampler.probability to 1, we reproduced the error.
Strangely, the error does not occur on every trace but still appears more often than with propbability of 0,05 that we had on the first event (6 times in 4 hour vs once a day).
After about 5 minutes, when kafka lag start to get up, we got the error. That causes consumer to shutdown with "Stopping the container due to an error" with the stacktrace from the issue.
After restarting the consumer, it consumes well and about 5 minutes then the error reproduces again.
We had the errors on differents micro services, that were in same condition of relatively big workload but always on a kafka consumer.
We dont find the trace in zipkin since it fails sending it, but we can assume looking at other tests that are similar than the number of span is about 10 000 by trace.

Answer 5 · 2023-03-20T14:58:35.000Z

10 000 spans for a trace sounds like a lot. Do you find this number of spans useful instead of overwhelming when debugging issues?

Answer 6 · 2023-03-21T16:05:30.000Z

yes this is rather overwhelming : the code flow goes through multiple services, mainly using bus messages with spring cloud stream, and there is indeed lot of spans we just don't need but are there nonetheless. For example : handle/streambridge/send/handle/tx. These span as useless for us ...

Can I dial this down ?

But as you suspected this is useless for now, that's why we're on the verge on completely disabling zipkin and just keeping sleuth for log correlation....

Answer 7 · 2023-03-21T16:33:03.000Z

If you're using Sleuth with Brave you can create a SpanHandler where you could decide which ones to filter out. The question is if the final result will not be broken in Zipkin.

Answer 8 · 2023-03-28T12:37:11.000Z

It seems to be problem with recursive closure overflow. We've got similar problem in one of our systems but it shouldn't produce that big traces in our case. We've got a service that handles kafkareceiver with concatmap and according to metrics it's around 5 records per second on average with occasional spikes. It should generate around 15-20 spans for each of those records so it would be like 5 * 20 = 100 spans per second, which isn't that much. The problem actually seems to be starting with exemplars. So when it actually wants to add exemplar to zipkin metric for spans.total counter it goes to PrometheusCounter which then asks for exemplarsampler to return an exemplar. This one goes to prometheus DefaultExemplarSampler which then calls upon sleuth to provide current span context, going to brave tracer, that goes for pending spans, ending a span, which increases counter for spans.total on zipking and it goes and goes again blocking the thread until there's nothing more on PendingSpans queue or StackOverflowError is thrown (whichever comes first, but occasionally we get that error which then completely breaks our kafka consumer).

Our trace looks like this:

java.lang.StackOverflowError: null
	at brave.internal.collect.WeakConcurrentMap.getIfPresent(WeakConcurrentMap.java:50)
	at brave.internal.recorder.PendingSpans.get(PendingSpans.java:59)
	at brave.Tracer.swapForPendingContext(Tracer.java:197)
	at brave.Tracer.toSpan(Tracer.java:381)
	at brave.Tracer.toSpan(Tracer.java:376)
	at brave.LazySpan.span(LazySpan.java:141)
	at brave.LazySpan.context(LazySpan.java:40)
	at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:49)
	at org.springframework.cloud.sleuth.instrument.prometheus.SleuthSpanContextSupplier.getSpanId(SleuthSpanContextSupplier.java:52)
	at org.springframework.cloud.sleuth.instrument.prometheus.LazySleuthSpanContextSupplier.getSpanId(LazySleuthSpanContextSupplier.java:48)
	at io.prometheus.client.exemplars.DefaultExemplarSampler.doSample(DefaultExemplarSampler.java:46)
	at io.prometheus.client.exemplars.DefaultExemplarSampler.sample(DefaultExemplarSampler.java:34)
	at io.micrometer.prometheus.PrometheusCounter.updateExemplar(PrometheusCounter.java:79)
	at io.micrometer.prometheus.PrometheusCounter.increment(PrometheusCounter.java:58)
	at zipkin2.reporter.metrics.micrometer.MicrometerReporterMetrics.incrementSpans(MicrometerReporterMetrics.java:111)
	at org.springframework.cloud.sleuth.autoconfig.zipkin2.ZipkinAutoConfiguration$TraceMetricsMicrometerConfiguration$NoReporterMetricsBeanConfiguration$LazyMicrometerReporterMetrics.incrementSpans(ZipkinAutoCo  nfiguration.java:293)
#### Below lines are repeated X times and are basically a copy of above lines up to a point####
at zipkin2.reporter.AsyncReporter$BoundedAsyncReporter.report(AsyncReporter.java:252)
at zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:44)
at zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:29)
at zipkin2.reporter.brave.ZipkinSpanHandler.end(ZipkinSpanHandler.java:148)
at brave.internal.handler.NoopAwareSpanHandler$CompositeSpanHandler.end(NoopAwareSpanHandler.java:107)
at brave.internal.handler.NoopAwareSpanHandler.end(NoopAwareSpanHandler.java:57)
at brave.internal.recorder.PendingSpans.expungeStaleEntries(PendingSpans.java:133)
at brave.internal.collect.WeakConcurrentMap.getIfPresent(WeakConcurrentMap.java:50)
at brave.internal.recorder.PendingSpans.get(PendingSpans.java:59)
at brave.Tracer.swapForPendingContext(Tracer.java:197)
at brave.Tracer.toSpan(Tracer.java:381)
at brave.Tracer.toSpan(Tracer.java:376)
at brave.LazySpan.span(LazySpan.java:141)
at brave.LazySpan.context(LazySpan.java:40)
at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:49)
at org.springframework.cloud.sleuth.instrument.prometheus.SleuthSpanContextSupplier.getSpanId(SleuthSpanContextSupplier.java:52)
at org.springframework.cloud.sleuth.instrument.prometheus.LazySleuthSpanContextSupplier.getSpanId(LazySleuthSpanContextSupplier.java:48)
at io.prometheus.client.exemplars.DefaultExemplarSampler.doSample(DefaultExemplarSampler.java:46)
at io.prometheus.client.exemplars.DefaultExemplarSampler.sample(DefaultExemplarSampler.java:34)
at io.micrometer.prometheus.PrometheusCounter.updateExemplar(PrometheusCounter.java:79)
at io.micrometer.prometheus.PrometheusCounter.increment(PrometheusCounter.java:58)
at zipkin2.reporter.metrics.micrometer.MicrometerReporterMetrics.incrementSpans(MicrometerReporterMetrics.java:111)
at org.springframework.cloud.sleuth.autoconfig.zipkin2.ZipkinAutoConfiguration$TraceMetricsMicrometerConfiguration$NoReporterMetricsBeanConfiguration$LazyMicrometerReporterMetrics.incrementSpans(ZipkinAutoConfiguration.java:293)

My guess is that it would be possible to disable this error for now simply by disabling exemplars in metrics, otherwise we need to "revisit" pending spans handling.

Answer 9 · 2023-03-28T13:03:58.000Z

Found the reason why it might happen for us. For some of the records the amount of db queries executed is bigger than normal (business reasons), which makes it so that for some of the records we get like 20 spans, but for some it can be even 100 or more. But not in thousands :) at most a hundred and a few.

Answer 10 · 2023-03-29T07:03:19.000Z

Could it be related with this as well ? #2244. That way we would have a lot of pending spans that are executing above scenario, when there's a lot of them.

Answer 11 · 2023-04-27T19:59:36.000Z

Any updates on this issue? I think we just diagnosed this same root cause with some of our applications.

Answer 12 · 2023-07-25T12:27:30.000Z

Hi, Any updates on this issue ? We are facing the exact same problem.

Answer 13 · 2023-08-22T14:26:55.000Z

Rough code to turn off exemplars (not sure resolves the StackOverflowError, but based on stacktrace appears may)

import io.prometheus.client.exemplars.Exemplar;
import io.prometheus.client.exemplars.ExemplarSampler;

...

    //PrometheusMetricsExportAutoConfiguration/io.micrometer.prometheus.PrometheusCounter
    @Bean
    //@ConditionalOnBean(SpanContextSupplier.class)
    public ExemplarSampler exemplarSampler() {//SpanContextSupplier spanContextSupplier) {
        return new ExemplarSampler() {
            @Override
            public Exemplar sample(double increment, Exemplar previous) {
                return null;
            }

            @Override
            public Exemplar sample(double value, double bucketFrom, double bucketTo, Exemplar previous) {
                return null;
            }
        };
        //return new DefaultExemplarSampler(spanContextSupplier);
    }

Answer 14 · 2024-02-09T13:27:12.000Z

Please upgrade to Micrometer Tracing. Spring Cloud Sleuth is feature complete and out of OSS support.