Hypertrace config is being loaded twice
Closed this issue · 4 comments
The config file is being loaded twice. The config should live in the bootstrap classloader and be loaded only once.
~/projects/demo-apps/vertx-examples/web-examples(4.x) » JAVA_TOOL_OPTIONS=-javaagent:/Users/ploffay/Downloads/hypertrace-agent-all.jar HT_SERVICE_NAME=vertx-rest-demo vertx run src/main/java/io/vertx/example/web/rest/SimpleREST.java 130 ↵ ploffay@Pavols-MBP
Picked up JAVA_TOOL_OPTIONS: -javaagent:/Users/ploffay/Downloads/hypertrace-agent-all.jar
[main] INFO org.hypertrace.agent.core.config.HypertraceConfig - Config loaded: {"serviceName":"vertx-rest-demo","reporting":{"endpoint":"http://localhost:9411/api/v2/spans","opa":{"endpoint":"http://opa.traceableai:8181/","pollPeriodSeconds":30}},"dataCapture":{"httpHeaders":{"request":true,"response":true},"httpBody":{"request":true,"response":true},"rpcMetadata":{"request":true,"response":true},"rpcBody":{"request":true,"response":true},"bodyMaxSizeBytes":131072},"propagationFormats":["TRACECONTEXT"]}
[opentelemetry.auto.trace 2021-01-11 11:07:42:881 +0100] [main] INFO org.hypertrace.agent.core.config.HypertraceConfig - Config loaded: {"serviceName":"vertx-rest-demo","reporting":{"endpoint":"http://localhost:9411/api/v2/spans","opa":{"endpoint":"http://opa.traceableai:8181/","pollPeriodSeconds":30}},"dataCapture":{"httpHeaders":{"request":true,"response":true},"httpBody":{"request":true,"response":true},"rpcMetadata":{"request":true,"response":true},"rpcBody":{"request":true,"response":true},"bodyMaxSizeBytes":131072},"propagationFormats":["TRACECONTEXT"]}
Hey @pavolloffay 👋 If you're looking for someone to take on this issue, I'd be happy to help!
It looks to me like the bug is with the implementation of double-checked locking in the HypertraceConfig.get()
method. The implementation here reminded me of an example described in this blog post, The "Double-Checked Locking Is Broken" Declaration.
The example they give in that article is very similar to the HypertraceConfig.get()
implementation
// (Still) Broken multithreaded version
// "Double-Checked Locking" idiom
class Foo {
private Helper helper = null;
public Helper getHelper() {
if (helper == null)
synchronized(this) {
if (helper == null)
helper = new Helper();
}
return helper;
}
// other functions and members...
}
The above case is not a valid implementation of double-checking locking because:
...reason it doesn't work it that the writes that initialize the Helper object and the write to the helper field can be done or perceived out of order. Thus, a thread which invokes getHelper() could see a non-null reference to a helper object, but see the default values for fields of the helper object, rather than the values set in the constructor.
Solutions
There's a number of solutions to this problem, I'm not positive which would be the best fit that lines up with the goals for the internal architecture of this project
- Avoid this problem by explicitly initializing the
AgentConfig
once and use dependency injection to provide it to the previous callers ofHypertraceConfig.get()
. This is probably the most "noisy" solution, and maybe not desirable depending on your long-term goals for the project - Create a separate class called
HypertraceConfigSingleton
modeled off of theHelperSingleton
from the double-checking locking blog - Leverage the
volatile
keyword. Given the read-heavy nature of config objects, making the staticAgentConfig
instance a volatile field is probably not a great solution. However, we can leverage a trick to make the static field behave as volatile by updating a volatile boolean field whenever we write the static field as described here - Leverage an existing implementation of double-checked locking, like Guava's NonSerializableMemoizingSupplier. Note that guava takes the approach described in solution 3 where a volatile boolean field is updated after the non-volatile
value
is written to. This guarantees that the readers and writers agree on whether or not thevalue
isnull
. I noticed that guava is used in one other place injavaagent-core
and it's theVisibleForTesting
annotation which is only retained at the class-level by the compiler but not by the VM at runtime. Using Guava in the fashion I'm describing here would require making this class available at runtime, which would increase the size of the shaded JAR produced by this project.
Based on my understanding of the solutions and this project's constraints, I think the best solution would be to improve the double-checking locking implementation with the volatile
keyword as described in solution 3, but curious to hear your thoughts! If you're not looking for contributions for this issue right now, I hope the analysis is helpful to whoever takes this on in the future!
I was curious to learn more about the internals of the Hypertrace javaagent, so I forked this repository and tried out a few of the approaches. I'll hold off on opening a PR until if/when a consensus is reached on the desired solution, but if it's helpful I have a branch leveraging the volatile
approach described in solution 3: sample diff
In our case, I think the main issue was that the config object was loaded in app/system classloader, then classes were added into bootstrap classloader and config was loaded again from agent classloader.
The issue seems to be fixed now, at least I am not able to reproduce it again
JAVA_TOOL_OPTIONS=-javaagent:/Users/ploffay/Downloads/hypertrace-agent-all.jar HT_SERVICE_NAME=vertx-rest-demo vertx run src/main/java/io/vertx/example/web/rest/SimpleREST.java ploffay@Pavols-MBP
Picked up JAVA_TOOL_OPTIONS: -javaagent:/Users/ploffay/Downloads/hypertrace-agent-all.jar
[opentelemetry.auto.trace 2021-03-22 08:23:14:630 +0100] [main] INFO org.hypertrace.agent.core.config.HypertraceConfig - Config loaded: {"serviceName":"vertx-rest-demo","reporting":{"endpoint":"http://localhost:9411/api/v2/spans","opa":{"endpoint":"http://opa.traceableai:8181/","pollPeriodSeconds":30}},"dataCapture":{"httpHeaders":{"request":true,"response":true},"httpBody":{"request":true,"response":true},"rpcMetadata":{"request":true,"response":true},"rpcBody":{"request":true,"response":true},"bodyMaxSizeBytes":131072},"propagationFormats":["TRACECONTEXT"],"enabled":true,"javaagent":{}}
[opentelemetry.auto.trace 2021-03-22 08:23:14:650 +0100] [main] INFO io.opentelemetry.javaagent.tooling.VersionLogger - opentelemetry-javaagent - version: 0.10.3
[opentelemetry.auto.trace 2021-03-22 08:23:14:682 +0100] [main] WARN org.hypertrace.agent.otel.extensions.HypertraceResourceProvider - Failed to read container id, cgroups file does not exist: /proc/self/cgroup (No such file or directory)
[opentelemetry.auto.trace 2021-03-22 08:23:14:862 +0100] [main] WARN org.hypertrace.agent.otel.extensions.HypertraceResourceProvider - Failed to read container id, cgroups file does not exist: /proc/self/cgroup (No such file or directory)
Hypertrace agent started, version: 0.10.3
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.codehaus.groovy.reflection.CachedClass (file:/Users/ploffay/.sdkman/candidates/vertx/3.6.0/lib/groovy-all-2.4.15.jar) to method java.lang.Object.finalize()
WARNING: Please consider reporting this to the maintainers of org.codehaus.groovy.reflection.CachedClass
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
warning: Implicitly compiled files were not subject to annotation processing.
Use -proc:none to disable annotation processing or -implicit to specify a policy for implicit compilation.
Succeeded in deploying verticle
^C[opentelemetry.auto.trace 2021-03-22 08:23:38:106 +0100] [opentelemetry-exec-shutdown-hook] WARN io.opentelemetry.javaagent.tooling.CommonTaskExecutor - Periodic task scheduler is shutdown. Will not run: cleaner for {}
But I agree that we should use volatile. Would you like to submit a PR?
I also tried to reproduce with the Vertx sample app and wasn't able to, I wasn't sure if it was just a rare race condition or if it was architecture dependent. Either way, good chance that this was fixed upstream somewhere! Happy to open a PR for the small thread-safety issue 😄