Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable taking node temporarily offline due to specific machine issue in Adoptium #5730

Open
2 of 3 tasks
sophia-guo opened this issue Oct 31, 2024 · 10 comments
Open
2 of 3 tasks
Assignees

Comments

@sophia-guo
Copy link
Contributor

sophia-guo commented Oct 31, 2024

Adding the parameter SLACK_CHANNEL to the configuration of https://ci.adoptium.net/view/Test_grinder/job/Test_Job_Auto_Gen/ can take node offline due to specfiic machine issues.

This issue opened to monitor any issues with this enabled.

16:36:43  Test_openjdk21_hs_sanity.external_x86-64_linux #36 result is FAILURE. Checking console log for specific errors...
Scripts not permitted to use new java.util.ArrayList. Administrators can decide whether to approve or reject this signature.

@sophia-guo
Copy link
Contributor Author

test-azure-ubuntu2404-x64-1 was hit twice due to the No space left on device. It was not marked as offline as No space left on device was on the error lists #5731.

https://ci.adoptium.net/job/Test_openjdk21_hs_sanity.openjdk_x86-64_linux_testList_1/19/console

15:44:39  Exception: hudson.AbortException: Failed to run ssh-agent: mkdtemp: private socket dir: No space left on device
15:44:39  
[Pipeline] timeout

https://ci.adoptium.net/job/Test_openjdk21_hs_special.system_x86-64_linux/28/console

[Pipeline] echo
15:37:04  Exception: hudson.AbortException: Failed to run ssh-agent: mkdtemp: private socket dir: No space left on device
15:37:04  

Currently test-azure-ubuntu2404-x64-1 is marked offline. I believe it's marked offline by jenkins auto-offline machines that are low on space?@sxa is it marked offline by infra's scheduled task?. How would infra process this case?

@adamfarley
Copy link
Contributor

adamfarley commented Dec 10, 2024

Heya @sophia-guo

It looks like the auto-offline logic isn't working at the moment.

In short, jobs like this one fail due to lack of space, and our attempt to take the machine offline fails with this error:

Scripts not permitted to use staticMethod hudson.model.User current. Administrators can decide whether to approve or reject this signature.

Which I presume is being caused by this code.

@llxia
Copy link
Contributor

llxia commented Dec 10, 2024

re #5730 (comment), this is not the code issue. As the error stated, Adoptium Jenkins Admin needs to permit using staticMethod hudson.model.User current at Adoptium Jenkins.

@adamfarley
Copy link
Contributor

adamfarley commented Dec 10, 2024

The problem is that the code is attempting to use a method it is not authorized to do so.

You are correct in that one solution is to get Jenkins admins to authorise use of that static method.

Your solution also looks like the best one when I compared it to alternatives (such as using SimpleOfflineCause instead of UserCause, which is less optimal because I'm not seeing a trivial way to create instances of the Localisable class).

P.S. I also discovered that the setTemporarilyOffline method is deprecated in favour of setTemporaryOfflineCause. I'm noting that here in case this setTemporarilyOffline is removed in a future update.

@smlambert
Copy link
Contributor

smlambert commented Dec 10, 2024

I have permitted it, but have also done so for a different method previously. Not sure what other ones will pop up. May be worth bringing that machine back online and sending a job to it to see if we get past any other approvals needed.

@adamfarley
Copy link
Contributor

That machine is back online now, though it now has much more free space (I raised an issue for it), and is unlikely to see this issue again any time soon.

@adamfarley
Copy link
Contributor

Will keep an eye open for automatic machine disabling in future triage (whether it works or not).

@sophia-guo
Copy link
Contributor Author

sophia-guo commented Dec 11, 2024

https://ci.adoptium.net/view/Test_grinder/job/Grinder/12049/ same agent no space left again. test-azure-ubuntu2404-x64-1. Just added the SLACK_CHANNEL parameter, so the permission issue happens again. https://ci.adoptium.net/view/Test_grinder/job/Grinder/12050/

Also:   org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: 1e6db258-b702-4687-9fc1-2b21e188b8f8
org.jenkinsci.plugins.scriptsecurity.sandbox.RejectedAccessException: Scripts not permitted to use staticMethod hudson.model.User current
	at PluginClassLoader for script-security//org.jenkinsci.plugins.scriptsecurity.sandbox.whitelists.StaticWhitelist.rejectStaticMethod(StaticWhitelist.java:258)

@smlambert maybe you can check if permission message pops up? And after this fix if we don't like this feature available for grinder I can remove the parameter.

@adamfarley if the machine is back with fix it's weird that it runs out of space in such short time.

@adamfarley
Copy link
Contributor

@adamfarley if the machine is back with fix it's weird that it runs out of space in such short time.

Agreed. Here's the issue: adoptium/infrastructure#3843

Note that this was the second time in a month that this machine has run out of space and been resurrected as "fixed", so perhaps a lack of overall storage space isn't the problem.

@adamfarley
Copy link
Contributor

adamfarley commented Dec 12, 2024

Useful link: API for "setTemporarilyOffline": https://javadoc.jenkins.io/hudson/model/Computer.html#setTemporarilyOffline(boolean,hudson.slaves.OfflineCause)

If we do go for a code fix instead of enabling us to use the currently-banned static method, I suggest this:

currentNode.setTemporaryOfflineCause(true, new hudson.slaves.OfflineCause.ByCLI("${message}"))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

4 participants