open-telemetry/opentelemetry-java-instrumentation

java.lang.OutOfMemoryError: Java heap space occured in api-gateway of the Spring Petclinic Microservice variant

matthias-huber opened this issue · 9 comments

Describe the bug
I used the Spring Pet Clinic (Microservice variant (commit 8818cc9)) as basis. Added opentelemetry agent 1.2.0 to all microservices (api-gateway, vets-,visits-,customer-service). Traces are exporter to OTel collector. Everything runs with docker-compose. I also changed from internal db to mysql setup. However, I don't think that it is important as the api-gateway doesn't use the database.

I ran a small loadtest with around ~10 req/s and the api-gateway ran out of memory => 2021-05-18 12:58:53.080 WARN 19 --- [or-http-epoll-5] i.n.c.AbstractChannelHandlerContext : An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:

VisualVM shows the CPU usage of the api-gateway. It shows the typical behaviour of a (memory leak) that less and less memory can be freed after each GC:

api-gateway-cpu

I took a heap dump of the api-gateway after the OutOfMemoryError occured.

Leak Suspect shows:
petclinic_leak_suspect

Histogram shows a lot of io.opentelemetry.javaagent.shaded.instrumentation.reactor.TracingSubscriber Instances

TracingSubscriber

Merge Shortest Path to GC root (with all references) shows the leak suspect "io.opentelemetry.javaagent.bootstrap.instrumentation.context.FieldBackedProvider$ContextStore$io$netty$util$concurrent$GenericFutureListener$io$netty$util$concurrent$GenericFutureListener"

TracingSubscriber_merge_shortest_root_gc

Steps to reproduce
If necessary, I can try to provide a sample or at least the important changes to the repo (-> Slack)

What did you expect to see?
That the api-gateway doesn't ran into the OutOfMemoryError

What did you see instead?
java.lang.OutOfMemoryError: Java heap space

What version are you using?
1.2.0

Environment
Java:
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode)

OS:
5.8.0-53-generic #60~20.04.1-Ubuntu

Additional context

@matthias-huber Thank you for bug report and analysis. Can you try the same load test with previous versions of agents as well, 1.1 and 1.0? It's not required, but may help us to narrow down the potential change that introduced this bug.

@iNikem probably #2851 I guess InstrumentationContext doesn't like value referencing key.

@iNikem I executed the same test (10minutes ~ 10 req/s) with the 1.1.0 agent and it doesn't show the same memory leak behaviour:

Heap (second drop at ~10:09 is caused by manually performing GC)

api-gateway-cpu-1 1 0

I also checked the live objects after the GC and as expected only a few io.opentelemetry.javaagent.shaded.instrumentation.reactor.TracingSubscriber objects are alive

tracing-subscriber-1 1 0

So, definitely it was introduced by the 1.2.0 version.

@breedx-splk does the iguanodon pick this regression up?

@breedx-splk does the iguanodon pick this regression up?

It doesn't seem to, no. It uses spring-petclinic-rest, which I think doesn't trigger the reactor instrumentation.

@trask Does it make sense to call out this issue in 1.2.0 documentation if there is no back port of this fix is available? Should be helpful for new adopters.

trask commented

hey @RashmiRam! do you have a suggestion on where to call out this issue? do you think new adopters are downloading older versions? thx!

AFAIk, people download penultimate versions for safety. I was thinking it could be put in the read me section as itself.

trask commented

What do you think if the README strongly recommends to review the release notes of newer versions if you are choosing not to download the latest version?