Enable taking node temporarily offline due to specific machine issue in Adoptium
Opened this issue · 10 comments
Adding the parameter SLACK_CHANNEL to the configuration of https://ci.adoptium.net/view/Test_grinder/job/Test_Job_Auto_Gen/ can take node offline due to specfiic machine issues.
This issue opened to monitor any issues with this enabled.
- Need permission to use new java.util.ArrayList https://ci.adoptium.net/job/Test_openjdk21_hs_sanity.openjdk_x86-64_linux_testList_1/19/console.
16:36:43 Test_openjdk21_hs_sanity.external_x86-64_linux #36 result is FAILURE. Checking console log for specific errors...
Scripts not permitted to use new java.util.ArrayList. Administrators can decide whether to approve or reject this signature.
- error not included
Exception: hudson.AbortException: Failed to run ssh-agent: mkdtemp: private socket dir: No space left on device
https://ci.adoptium.net/job/Test_openjdk21_hs_special.system_x86-64_linux/28/console - open infra issue correspondingly if works fine - The process of current jenkins' auto-offline machines that are low on space: If it happens it will be flagged in nagios and the infrastructure-bot channel which is regularly monitored by the team so the actions will take effect based on getting our attention that way. Individual. At the moment the process isn't strict - whoever in the infra team picks it up can decide whether to raise an issue on it. https://adoptium.slack.com/archives/C53GHCXL4/p1730735251541479?thread_ts=1730227347.646299&cid=C53GHCXL4
test-azure-ubuntu2404-x64-1 was hit twice due to the No space left on device
. It was not marked as offline as No space left on device
was on the error lists #5731.
https://ci.adoptium.net/job/Test_openjdk21_hs_sanity.openjdk_x86-64_linux_testList_1/19/console
15:44:39 Exception: hudson.AbortException: Failed to run ssh-agent: mkdtemp: private socket dir: No space left on device
15:44:39
[Pipeline] timeout
https://ci.adoptium.net/job/Test_openjdk21_hs_special.system_x86-64_linux/28/console
[Pipeline] echo
15:37:04 Exception: hudson.AbortException: Failed to run ssh-agent: mkdtemp: private socket dir: No space left on device
15:37:04
Currently test-azure-ubuntu2404-x64-1 is marked offline. I believe it's marked offline by jenkins auto-offline machines that are low on space?@sxa is it marked offline by infra's scheduled task?. How would infra process this case?
Heya @sophia-guo
It looks like the auto-offline logic isn't working at the moment.
In short, jobs like this one fail due to lack of space, and our attempt to take the machine offline fails with this error:
Scripts not permitted to use staticMethod hudson.model.User current. Administrators can decide whether to approve or reject this signature.
Which I presume is being caused by this code.
re #5730 (comment), this is not the code issue. As the error stated, Adoptium Jenkins Admin needs to permit using staticMethod hudson.model.User current at Adoptium Jenkins.
The problem is that the code is attempting to use a method it is not authorized to do so.
You are correct in that one solution is to get Jenkins admins to authorise use of that static method.
Your solution also looks like the best one when I compared it to alternatives (such as using SimpleOfflineCause instead of UserCause, which is less optimal because I'm not seeing a trivial way to create instances of the Localisable class).
P.S. I also discovered that the setTemporarilyOffline method is deprecated in favour of setTemporaryOfflineCause. I'm noting that here in case this setTemporarilyOffline is removed in a future update.
I have permitted it, but have also done so for a different method previously. Not sure what other ones will pop up. May be worth bringing that machine back online and sending a job to it to see if we get past any other approvals needed.
That machine is back online now, though it now has much more free space (I raised an issue for it), and is unlikely to see this issue again any time soon.
Will keep an eye open for automatic machine disabling in future triage (whether it works or not).
https://ci.adoptium.net/view/Test_grinder/job/Grinder/12049/ same agent no space left again. test-azure-ubuntu2404-x64-1. Just added the SLACK_CHANNEL parameter, so the permission issue happens again. https://ci.adoptium.net/view/Test_grinder/job/Grinder/12050/
Also: org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: 1e6db258-b702-4687-9fc1-2b21e188b8f8
org.jenkinsci.plugins.scriptsecurity.sandbox.RejectedAccessException: Scripts not permitted to use staticMethod hudson.model.User current
at PluginClassLoader for script-security//org.jenkinsci.plugins.scriptsecurity.sandbox.whitelists.StaticWhitelist.rejectStaticMethod(StaticWhitelist.java:258)
@smlambert maybe you can check if permission message pops up? And after this fix if we don't like this feature available for grinder I can remove the parameter.
@adamfarley if the machine is back with fix it's weird that it runs out of space in such short time.
@adamfarley if the machine is back with fix it's weird that it runs out of space in such short time.
Agreed. Here's the issue: adoptium/infrastructure#3843
Note that this was the second time in a month that this machine has run out of space and been resurrected as "fixed", so perhaps a lack of overall storage space isn't the problem.
Useful link: API for "setTemporarilyOffline": https://javadoc.jenkins.io/hudson/model/Computer.html#setTemporarilyOffline(boolean,hudson.slaves.OfflineCause)
If we do go for a code fix instead of enabling us to use the currently-banned static method, I suggest this:
currentNode.setTemporaryOfflineCause(true, new hudson.slaves.OfflineCause.ByCLI("${message}"))