StackOverflowError on Span end when zipkin is sampling
Pgaze opened this issue · 14 comments
Describe the bug
Using version 3.1.3 of spring-cloud-starter-sleuth-zipkin, a StackOverflowError
sometimes happen when ending a Span.
Stacktrace
java.lang.StackOverflowError: null
at java.base/java.lang.ClassLoader.defineClass1(Native Method)
at java.base/java.lang.ClassLoader.defineClass(Unknown Source)
at java.base/java.security.SecureClassLoader.defineClass(Unknown Source)
at java.base/java.net.URLClassLoader.defineClass(Unknown Source)
at java.base/java.net.URLClassLoader$1.run(Unknown Source)
at java.base/java.net.URLClassLoader$1.run(Unknown Source)
at java.base/java.security.AccessController.doPrivileged(Unknown Source)
at java.base/java.net.URLClassLoader.findClass(Unknown Source)
at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
at org.springframework.boot.loader.LaunchedURLClassLoader.loadClass(LaunchedURLClassLoader.java:151)
at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
at brave.internal.handler.NoopAwareSpanHandler.end(NoopAwareSpanHandler.java:59)
at brave.internal.recorder.PendingSpans.expungeStaleEntries(PendingSpans.java:133)
at brave.internal.collect.WeakConcurrentMap.getIfPresent(WeakConcurrentMap.java:50)
at brave.internal.recorder.PendingSpans.get(PendingSpans.java:59)
at brave.Tracer.swapForPendingContext(Tracer.java:197)
at brave.Tracer.toSpan(Tracer.java:381)
at brave.Tracer.toSpan(Tracer.java:376)
at brave.LazySpan.span(LazySpan.java:141)
at brave.LazySpan.context(LazySpan.java:40)
at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:49)
at org.springframework.cloud.sleuth.instrument.prometheus.SleuthSpanContextSupplier.getSpanId(SleuthSpanContextSupplier.java:52)
at org.springframework.cloud.sleuth.instrument.prometheus.LazySleuthSpanContextSupplier.getSpanId(LazySleuthSpanContextSupplier.java:48)
at io.prometheus.client.exemplars.DefaultExemplarSampler.doSample(DefaultExemplarSampler.java:46)
at io.prometheus.client.exemplars.DefaultExemplarSampler.sample(DefaultExemplarSampler.java:34)
at io.micrometer.prometheus.PrometheusCounter.updateExemplar(PrometheusCounter.java:79)
at io.micrometer.prometheus.PrometheusCounter.increment(PrometheusCounter.java:58)
at zipkin2.reporter.metrics.micrometer.MicrometerReporterMetrics.incrementSpans(MicrometerReporterMetrics.java:111)
at zipkin2.reporter.AsyncReporter$BoundedAsyncReporter.report(AsyncReporter.java:252)
at zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:44)
at zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:29)
at zipkin2.reporter.brave.ZipkinSpanHandler.end(ZipkinSpanHandler.java:148)
at brave.internal.handler.NoopAwareSpanHandler$CompositeSpanHandler.end(NoopAwareSpanHandler.java:107)
at brave.internal.handler.NoopAwareSpanHandler.end(NoopAwareSpanHandler.java:57)
at brave.internal.recorder.PendingSpans.expungeStaleEntries(PendingSpans.java:133)
at brave.internal.collect.WeakConcurrentMap.getIfPresent(WeakConcurrentMap.java:50)
at brave.internal.recorder.PendingSpans.get(PendingSpans.java:59)
at brave.Tracer.swapForPendingContext(Tracer.java:197)
at brave.Tracer.toSpan(Tracer.java:381)
at brave.Tracer.toSpan(Tracer.java:376)
at brave.LazySpan.span(LazySpan.java:141)
at brave.LazySpan.context(LazySpan.java:40)
at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:49)
at org.springframework.cloud.sleuth.instrument.prometheus.SleuthSpanContextSupplier.getSpanId(SleuthSpanContextSupplier.java:52)
at org.springframework.cloud.sleuth.instrument.prometheus.LazySleuthSpanContextSupplier.getSpanId(LazySleuthSpanContextSupplier.java:48)
at io.prometheus.client.exemplars.DefaultExemplarSampler.doSample(DefaultExemplarSampler.java:46)
at io.prometheus.client.exemplars.DefaultExemplarSampler.sample(DefaultExemplarSampler.java:34)
at io.micrometer.prometheus.PrometheusCounter.updateExemplar(PrometheusCounter.java:79)
at io.micrometer.prometheus.PrometheusCounter.increment(PrometheusCounter.java:58)
at zipkin2.reporter.metrics.micrometer.MicrometerReporterMetrics.incrementSpans(MicrometerReporterMetrics.java:111)
at zipkin2.reporter.AsyncReporter$BoundedAsyncReporter.report(AsyncReporter.java:252)
at zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:44)
at zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:29)
at zipkin2.reporter.brave.ZipkinSpanHandler.end(ZipkinSpanHandler.java:148)
at brave.internal.handler.NoopAwareSpanHandler$CompositeSpanHandler.end(NoopAwareSpanHandler.java:107)
(...)
This only happens when the trace is sampled by zipkin, since error has not been reproduced with spring.sleuth.sampler.probability
set to 0.
Looks like the Span being ended is also one of the staleEntries to expunge by brave.internal.recorder.PendingSpans
.
Context is the end of the treatment of Kafka consumer that leads to other messages being produced with the same traceId if that helps.
That's interesting! Do you have a reproducer by any chance?
Not at the moment, I will try to make one
Unfortunately, I have not been able to reproduce with a simple application. We will try to set sampler propbabilty to 1 later in our sprint to verify if it happens randomly or every time Zipkin is sampling. I'll keep you updated with our results
We manage to do the test today on our preprod. Setting spring.sleuth.sampler.probability
to 1, we reproduced the error.
Strangely, the error does not occur on every trace but still appears more often than with propbability of 0,05 that we had on the first event (6 times in 4 hour vs once a day).
After about 5 minutes, when kafka lag start to get up, we got the error. That causes consumer to shutdown with "Stopping the container due to an error" with the stacktrace from the issue.
After restarting the consumer, it consumes well and about 5 minutes then the error reproduces again.
We had the errors on differents micro services, that were in same condition of relatively big workload but always on a kafka consumer.
We dont find the trace in zipkin since it fails sending it, but we can assume looking at other tests that are similar than the number of span is about 10 000 by trace.
10 000 spans for a trace sounds like a lot. Do you find this number of spans useful instead of overwhelming when debugging issues?
yes this is rather overwhelming : the code flow goes through multiple services, mainly using bus messages with spring cloud stream, and there is indeed lot of spans we just don't need but are there nonetheless. For example : handle/streambridge/send/handle/tx. These span as useless for us ...
But as you suspected this is useless for now, that's why we're on the verge on completely disabling zipkin and just keeping sleuth for log correlation....
If you're using Sleuth with Brave you can create a SpanHandler
where you could decide which ones to filter out. The question is if the final result will not be broken in Zipkin.
It seems to be problem with recursive closure overflow. We've got similar problem in one of our systems but it shouldn't produce that big traces in our case. We've got a service that handles kafkareceiver with concatmap and according to metrics it's around 5 records per second on average with occasional spikes. It should generate around 15-20 spans for each of those records so it would be like 5 * 20 = 100 spans per second, which isn't that much. The problem actually seems to be starting with exemplars. So when it actually wants to add exemplar to zipkin metric for spans.total counter it goes to PrometheusCounter which then asks for exemplarsampler to return an exemplar. This one goes to prometheus DefaultExemplarSampler which then calls upon sleuth to provide current span context, going to brave tracer, that goes for pending spans, ending a span, which increases counter for spans.total on zipking and it goes and goes again blocking the thread until there's nothing more on PendingSpans queue or StackOverflowError is thrown (whichever comes first, but occasionally we get that error which then completely breaks our kafka consumer).
Our trace looks like this:
java.lang.StackOverflowError: null
at brave.internal.collect.WeakConcurrentMap.getIfPresent(WeakConcurrentMap.java:50)
at brave.internal.recorder.PendingSpans.get(PendingSpans.java:59)
at brave.Tracer.swapForPendingContext(Tracer.java:197)
at brave.Tracer.toSpan(Tracer.java:381)
at brave.Tracer.toSpan(Tracer.java:376)
at brave.LazySpan.span(LazySpan.java:141)
at brave.LazySpan.context(LazySpan.java:40)
at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:49)
at org.springframework.cloud.sleuth.instrument.prometheus.SleuthSpanContextSupplier.getSpanId(SleuthSpanContextSupplier.java:52)
at org.springframework.cloud.sleuth.instrument.prometheus.LazySleuthSpanContextSupplier.getSpanId(LazySleuthSpanContextSupplier.java:48)
at io.prometheus.client.exemplars.DefaultExemplarSampler.doSample(DefaultExemplarSampler.java:46)
at io.prometheus.client.exemplars.DefaultExemplarSampler.sample(DefaultExemplarSampler.java:34)
at io.micrometer.prometheus.PrometheusCounter.updateExemplar(PrometheusCounter.java:79)
at io.micrometer.prometheus.PrometheusCounter.increment(PrometheusCounter.java:58)
at zipkin2.reporter.metrics.micrometer.MicrometerReporterMetrics.incrementSpans(MicrometerReporterMetrics.java:111)
at org.springframework.cloud.sleuth.autoconfig.zipkin2.ZipkinAutoConfiguration$TraceMetricsMicrometerConfiguration$NoReporterMetricsBeanConfiguration$LazyMicrometerReporterMetrics.incrementSpans(ZipkinAutoCo nfiguration.java:293)
#### Below lines are repeated X times and are basically a copy of above lines up to a point####
at zipkin2.reporter.AsyncReporter$BoundedAsyncReporter.report(AsyncReporter.java:252)
at zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:44)
at zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:29)
at zipkin2.reporter.brave.ZipkinSpanHandler.end(ZipkinSpanHandler.java:148)
at brave.internal.handler.NoopAwareSpanHandler$CompositeSpanHandler.end(NoopAwareSpanHandler.java:107)
at brave.internal.handler.NoopAwareSpanHandler.end(NoopAwareSpanHandler.java:57)
at brave.internal.recorder.PendingSpans.expungeStaleEntries(PendingSpans.java:133)
at brave.internal.collect.WeakConcurrentMap.getIfPresent(WeakConcurrentMap.java:50)
at brave.internal.recorder.PendingSpans.get(PendingSpans.java:59)
at brave.Tracer.swapForPendingContext(Tracer.java:197)
at brave.Tracer.toSpan(Tracer.java:381)
at brave.Tracer.toSpan(Tracer.java:376)
at brave.LazySpan.span(LazySpan.java:141)
at brave.LazySpan.context(LazySpan.java:40)
at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:49)
at org.springframework.cloud.sleuth.instrument.prometheus.SleuthSpanContextSupplier.getSpanId(SleuthSpanContextSupplier.java:52)
at org.springframework.cloud.sleuth.instrument.prometheus.LazySleuthSpanContextSupplier.getSpanId(LazySleuthSpanContextSupplier.java:48)
at io.prometheus.client.exemplars.DefaultExemplarSampler.doSample(DefaultExemplarSampler.java:46)
at io.prometheus.client.exemplars.DefaultExemplarSampler.sample(DefaultExemplarSampler.java:34)
at io.micrometer.prometheus.PrometheusCounter.updateExemplar(PrometheusCounter.java:79)
at io.micrometer.prometheus.PrometheusCounter.increment(PrometheusCounter.java:58)
at zipkin2.reporter.metrics.micrometer.MicrometerReporterMetrics.incrementSpans(MicrometerReporterMetrics.java:111)
at org.springframework.cloud.sleuth.autoconfig.zipkin2.ZipkinAutoConfiguration$TraceMetricsMicrometerConfiguration$NoReporterMetricsBeanConfiguration$LazyMicrometerReporterMetrics.incrementSpans(ZipkinAutoConfiguration.java:293)
My guess is that it would be possible to disable this error for now simply by disabling exemplars in metrics, otherwise we need to "revisit" pending spans handling.
Found the reason why it might happen for us. For some of the records the amount of db queries executed is bigger than normal (business reasons), which makes it so that for some of the records we get like 20 spans, but for some it can be even 100 or more. But not in thousands :) at most a hundred and a few.
Could it be related with this as well ? #2244. That way we would have a lot of pending spans that are executing above scenario, when there's a lot of them.
Any updates on this issue? I think we just diagnosed this same root cause with some of our applications.
Hi, Any updates on this issue ? We are facing the exact same problem.
Rough code to turn off exemplars (not sure resolves the StackOverflowError, but based on stacktrace appears may)
import io.prometheus.client.exemplars.Exemplar;
import io.prometheus.client.exemplars.ExemplarSampler;
...
//PrometheusMetricsExportAutoConfiguration/io.micrometer.prometheus.PrometheusCounter
@Bean
//@ConditionalOnBean(SpanContextSupplier.class)
public ExemplarSampler exemplarSampler() {//SpanContextSupplier spanContextSupplier) {
return new ExemplarSampler() {
@Override
public Exemplar sample(double increment, Exemplar previous) {
return null;
}
@Override
public Exemplar sample(double value, double bucketFrom, double bucketTo, Exemplar previous) {
return null;
}
};
//return new DefaultExemplarSampler(spanContextSupplier);
}
Please upgrade to Micrometer Tracing. Spring Cloud Sleuth is feature complete and out of OSS support.