JMX agent metrics from Wildfly/Undertow broken since v0.8
Starefossen opened this issue · 23 comments
I recently started investigating how to use the jmx_exporter
as a -javaagent
with Wildfly 10.1.0.Final and Undertow 3.1. However, none of the Wildfly/Undertow specific metrics was picked up by jmx_exporter
when using version 0.8 or the latest 0.9.
With some help from @cfrantsen in #87 we discovered that jmx_exporter
v0.7 is working correctly, suggesting that somewhere between v0.7 and v0.8 is the breaking change.
I have made a minimal proof of concept for this at Starefossen/docker-wildfly-prometheus if anyone is interested in debugging this issue.
Ping @brian-brazil. Are there any more information regarding this I can give you? Unfortunately I am not experienced enough with Wildfly/Undertow/JMX in order to debug this myself any further. As mentioned, we are using v0.7 in production at this point, but we would like to keep up with the newest versions of jmx_exporter.
Have you tried 0.9?
Have you tried it without a config file?
If the JMX exporter is returning metrics at all, that indicates an issue at the JMX level rather than the jmx exporter level.
Sorry for the late response Brian. Yes, same result without a config file. As I mentioned; downgrading prometheus/jmx_exporter
to v0.7 without changing the Wildfly application at all solves this issue.
For reference only, the same issue exists for me in Wildfly 10.0.0.Final
I traced the issue down to the upgrade of the simpleclient dependencies to 0.0.21 as with them set back to 0.0.16 (simpleclient) and 0.0.8 (simpleclient_hotspot, simpleclient_servlet) the current code works and produces the jboss.as tree again.
So something is wrong with one of the underlying simpleclient libraries
@brian-brazil Wildfly 10.0.0.Final jboss.as tree is detected when using up to and including 0.0.18 of simple client but fails with the current behaviour in 0.0.19. So somewhere in 0.0.19 is the break
Everything is shadowed, so there should be no interactions beyond JMX. Could you git bisect to see what's going on?
@brian-brazil: that's indirectly what I did - I checked out master and changed the simpleclient dependencies, rebuilt and tested with Wildfly 10.0.0.Final
I started with simpleclient 0.0.16 simpleclient_hotstpot, simpleclient_servlet at 0.0.8 as they were for the 0.0.7 release of jmx_exporter.
Then I tried all simpleclient dependencies at 0.0.16 - successful.
Simpleclient dependencies at 0.0.19 - failed
Simpleclient dependencies at 0.0.18 - success
In jmx_exporter version 0.7 -> 0.8 you switched the simpleclient dependency only to 0.0.21 (hotspot and servlet still at 0.0.8)
In 0.8 -> 0.9 you aligned all the simpleclient dependencies to 0.0.21
jboss.as tree broke in 0.7 -> 0.8
There's two years worth of changes in there, I'd suggest switching the dep to SNAPSHOT and building against the various versions to see which exact PR is the issue.
@brian-brazil for reference I built jmx_exporter (0.0.10-SNAPSHOT) against:
simpleclient 0.0.18 - success
simpleclient 0.0.19 - failed
simpleclient 0.0.22-SNAPSHOT - failed
Testing each as a javagent to stock Wildfly 10.0.0.Final with no jmx config
Note that it only fails to access the jboss.as mbean when trying to run as a JMX agent - when attempting to connect via admin URL it works (but is dog slow and requires the jboss-cli.jar on the classpath)
@brian-brazil: I can confirm this is due to initialisation order of the Platform MBean Server.
Wildfly uses a custom MBeanServerFactory and MBeanServer implementation by setting system property javax.management.builder.initial.
However when using jmx_exporter as a java agent, it is initialising its MBeanServer before the custom factory implementation has been loaded (javax.management.builder.initial is null)
Some potential ways to alleviate this:
- Introduce a configurable delay before the creation of the MBeanServer as per #97
- Add an appropriate flag to the config and in doScrape check if system property javax.management.builder.initial is set, if so, continue as normal, if null return (e.g. 503) and retry on next request.
I also worry the latter solution has the potential of a race condition with Wildfly
I did attempt to release the early stock MBeanServer via MBeanServerFactory.releaseMBeanServer but subsequent Management.getPlatformMBeanServer calls still resulted in the stock MBeanServer rather than Wildfly's custom implementation
Why does Wildfly use a custom implementation? What's different?
@brian-brazil: At a guess:
- integrate domain security, when running in a domain
- provide remote+http transport mechanism
- provide MBeanSever that can accept simple expressions
What's a domain in this context?
provide remote+http transport mechanism
That doesn't require a different mbean implementation, the jmx exporter does this for example.
@brian-brazil:
domain is https://docs.jboss.org/author/display/WFLY10/Domain+Setup
And re remote+http transport, I did say "guess". I'm just a user of Wildfly, nothing more
That looks like a wildfly-internal idea. Usually when someone says "Join a domain" they mean a Windows domain.
@brian-brazil understood.
I have been experimenting with a couple of solutions. I have one where I "release" the MBeanServer based on a change in the javax.management.builder.initial property. It doesn't work at the moment because I think references to the MBeanServer are stored when calling things like ManagementFactory.getClassLoadingMXBean - but that is supposition. Implementing a reset capability would require modification of the simpleclient libraries too. However the advantage is that this should work with any container / application that uses non-standard MBeanServer and requires no extra configuration from the user.
Alternately I have a working solution which delays the web server start by some configurable amount as per #97. I had to implement the Describable interface (returning an empty list) to prevent the registry from calling the collect method to early. This is only so everything can be configured at startup but only made active after the timeout. I'll push this as a potential solution to #97.
Another option is simply to delay construction of everything until the timeout passes and Describable as is (called from register in this case)
Do you have any thoughts on the above?
The code in this repo calls ManagementFactory.getPlatformMBeanServer()
at every scrape. So changes to the initial builder shouldn't matter if I understand correctly.
application that uses non-standard MBeanServer and requires no extra configuration from the user.
We've thus far taken the stance that any application using anything other than the standard mbeanserver isn't supported, as we have no idea how to find it or how to deal with multiple such objects inside on JVM.
I'd still like to know why wildfly is doing something non-standard here.
Implementing a reset capability would require modification of the simpleclient libraries too.
That shouldn't be required, that'd be internal to the collector.
Alternately I have a working solution which delays the web server start by some configurable amount
That's a workaround for broken servers that can't deal with JMX requests during initilisation, often crashing when it happens. This is a different issue.
@brian-brazil: I did a bit of research and came across this:
https://developer.jboss.org/thread/240260
As an aside, I added code to deliberately release the MBeanServer at the end of the scrape by setting it to null. This didn't help, each time the class of the MBeanServer was the same, in spite of the javax.management.builder.initial being set by Wildfly at some point in its start up process. It seems that once it is initialised via getPlatformMBeanServer it is never releasable. I also attempted to release it using MBeanServerFactory.releaseMBeanServer but it raises an exception stating its not in the list.
Here is the initialisation code in Wildfly: https://github.com/jboss-modules/jboss-modules/blob/master/src/main/java/org/jboss/modules/Main.java#L484
@brian-brazil: I looked at OpenJDKs implementation of getPlatformMBeanServer. The documentation is a little bit disingenuous as there is no way to 'reset' the platform MBean once created:
http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/4c95cacb8ec7/src/share/classes/java/lang/management/ManagementFactory.java#l461
They use a private static to store it.
The only way I've been able to get this to work at all is either by:
- delaying the call to ManagementFactory.getPlatformMBeanServer as per #143 OR
- setting _-Djavax.management.builder.initial=InvalidClass - this has the effect that the initial call(s) to getPlatformMBeanServer fail with ClassNotFound exception but this is corrected once Wildfly overwrites the javax.management.builder.initial value.
So how does Wildfly propose that those running java agents handle this? I imagine this affects all monitoring agents, not just us.
The solution I came up with for their end would be to offer an agent that just sets the builder to their one, which you could then load before any other agents.
@brian-brazil: they don't have a proposition that I am aware of. We'd have to communicate on their mailing list and ask. I speculate they'd suggest switching to using a jboss-module instead of a java agent and simply expose the metrics via the admin port