SSHLauncher tries to launch Windows agent for more than 13 hours
Opened this issue · 3 comments
Jenkins and plugins versions report
Environment
Jenkins: 2.361.2
OS: Linux - 4.18.0-305.65.1.el8_4.x86_64
---
Parameterized-Remote-Trigger:3.1.6.3
ace-editor:1.1
ansicolor:1.0.2
antisamy-markup-formatter:2.7
apache-httpcomponents-client-4-api:4.5.13-138.v4e7d9a_7b_a_e61
authentication-tokens:1.4
authorize-project:1.4.0
aws-credentials:191.vcb_f183ce58b_9
aws-java-sdk-ec2:1.12.287-357.vf82d85a_6eefd
aws-java-sdk-minimal:1.12.287-357.vf82d85a_6eefd
badge:1.9.1
bootstrap4-api:4.6.0-5
bootstrap5-api:5.2.1-2
bouncycastle-api:2.26
branch-api:2.1046.v0ca_37783ecc5
build-timeout:1.24
caffeine-api:2.9.3-65.v6a_47d0f4d1fe
checks-api:1.7.5
cloud-stats:0.27
cloudbees-disk-usage-simple:170.va_fd5b_4ee6858
cloudbees-folder:6.758.vfd75d09eea_a_1
command-launcher:90.v669d7ccb_7c31
conditional-buildstep:1.4.2
config-file-provider:3.11.1
configuration-as-code:1512.vb_79d418d5fc8
credentials:1143.vb_e8b_b_ceee347
credentials-binding:523.vd859a_4b_122e6
description-setter:1.10
disk-usage:0.28
display-url-api:2.3.6
dumpling:2.2
durable-task:500.v8927d9fd99d8
ec2:2.0.2
echarts-api:5.3.3-1
email-ext:2.91
envinject:2.875.v9b_9e962da_a_ec
envinject-api:1.199.v3ce31253ed13
extended-read-permission:3.2
favorite:2.4.1
flexible-publish:0.16.1
font-awesome-api:6.2.0-2
ghprb:1.42.2
git:4.12.1
git-client:3.12.0
git-server:99.va_0826a_b_cdfa_d
github:1.35.0
github-api:1.303-400.v35c2d8258028
gitlab-plugin:1.5.35
greenballs:1.15.1
groovy:442.v817e6d937d6c
groovy-postbuild:2.5
h2-api:1.4.199
handlebars:3.0.8
htmlpublisher:1.31
instance-identity:116.vf8f487400980
ionicons-api:24.va_745a_b_a_97fc3
jackson2-api:2.13.3-285.vc03c0256d517
jakarta-activation-api:2.0.1-1
jakarta-mail-api:2.0.1-1
javadoc:226.v71211feb_e7e9
javax-activation-api:1.2.0-4
javax-mail-api:1.6.2-5
jaxb:2.3.6-1
jdk-tool:55.v1b_32b_6ca_f9ca
jersey2-api:2.37-1
jms-messaging:1.1.27
job-dsl:1.81
jobConfigHistory:1176.v1b_4290db_41a_5
jquery3-api:3.6.1-1
jsch:0.1.55.61.va_e9ee26616e7
junit:1119.1121.vc43d0fc45561
kerberos-sso:1.10
kubernetes:3706.vdfb_d599579f3
kubernetes-client-api:5.12.2-193.v26a_6078f65a_9
kubernetes-credentials:0.9.0
ldap:2.12
lockable-resources:2.16
mailer:438.v02c7f0a_12fa_4
matrix-auth:3.1.5
matrix-project:785.v06b_7f47b_c631
matrix-reloaded:1.1.3
metrics:4.2.10-389.v93143621b_050
mina-sshd-api-common:2.9.1-44.v476733c11f82
mina-sshd-api-core:2.9.1-44.v476733c11f82
momentjs:1.1.1
monitoring:1.91.0
naginator:1.18.1
node-iterator-api:49.v58a_8b_35f8363
okhttp-api:4.9.3-108.v0feda04578cf
openstack-cloud:2.63
parameterized-scheduler:1.1
parameterized-trigger:2.45
pipeline-build-step:2.18
pipeline-github-lib:38.v445716ea_edda_
pipeline-graph-analysis:195.v5812d95a_a_2f9
pipeline-groovy-lib:612.v84da_9c54906d
pipeline-input-step:451.vf1a_a_4f405289
pipeline-maven:1205.vceea_7b_972817
pipeline-milestone-step:101.vd572fef9d926
pipeline-model-api:2.2114.v2654ca_721309
pipeline-model-definition:2.2114.v2654ca_721309
pipeline-model-extensions:2.2114.v2654ca_721309
pipeline-rest-api:2.24
pipeline-stage-step:296.v5f6908f017a_5
pipeline-stage-tags-metadata:2.2114.v2654ca_721309
pipeline-stage-view:2.24
pipeline-utility-steps:2.13.0
plain-credentials:139.ved2b_9cf7587b
plugin-util-api:2.17.0
popper-api:1.16.1-3
popper2-api:2.11.6-1
postbuildscript:3.2.0-460.va_fda_0fa_26720
proc-cleaner-plugin:2.4
prometheus:2.0.11
rebuild:1.34
redhat-ci-plugin:1.5.16
resource-disposer:0.20
role-strategy:561.v9846c7351a_41
run-condition:1.5
scm-api:621.vda_a_b_055e58f7
script-security:1183.v774b_0b_0a_a_451
short-workspace-path:0.3
snakeyaml-api:1.31-84.ve43da_fb_49d0b
ssh-credentials:305.v8f4381501156
ssh-slaves:2.846.v1b_70190624f5
sshd:3.249.v2dc2ea_416e33
structs:324.va_f5d6774f3a_d
swarm:3.34
test-stability:2.3
testng-plugin:555.va0d5f66521e3
text-finder:1.20
timestamper:1.20
token-macro:308.v4f2b_ed62b_b_16
trilead-api:2.72.v2a_3236754f73
update-sites-manager:2.0.0
variant:59.vf075fe829ccb
windows-slaves:1.8.1
workflow-aggregator:590.v6a_d052e5a_a_b_5
workflow-api:1200.v8005c684b_a_c6
workflow-basic-steps:994.vd57e3ca_46d24
workflow-cps:2759.v87459c4eea_ca_
workflow-cps-global-lib:588.v576c103a_ff86
workflow-durable-task-step:1199.v02b_9244f8064
workflow-job:1236.vc3a_d1602f439
workflow-multibranch:716.vc692a_e52371b_
workflow-scm-step:400.v6b_89a_1317c9a_
workflow-step-api:639.v6eca_cd8c04a_a_
workflow-support:838.va_3a_087b_4055b
ws-cleanup:0.43
zmq-event-publisher:0.0.5
What Operating System are you using (both controller, and any agents involved in the problem)?
agent - Windows Server 2019 (CYGWIN_NT-10.0-17763)
Reproduction steps
Provision Windows nodes again and again until you hit the problem.
Expected Results
One of:
SSHLauncher
should automatically re-launchJCloudsLauncher
on Windows (re-try mechanism)- there should be timeout to drop launching (maybe re-launch)
- log error message
Actual Results
Sometimes ("randomly") provisioning fails and node is stuck in state "launch", in cloud stats I can see 13 hr and counting (then I terminated node manually), however there is no issue with the agent. I can access node from my computer with ssh
and there are no logs in /remoting
(the directory does not exist, but remoting.jar is there). Using "relaunch agent" button does not help.
Here is edited log:
SSHLauncher{host='xx.xx.xx.xx', port=22, credentialsId='xxxxxx', jvmOptions='', javaPath='', prefixStartSlaveCmd='', suffixStartSlaveCmd='', launchTimeoutSeconds=600, maxNumRetries=5, retryWaitTime=15, sshHostKeyVerificationStrategy=hudson.plugins.sshslaves.verifiers.NonVerifyingKeyVerificationStrategy, tcpNoDelay=true, trackCredentials=true}
[01/26/23 20:59:31] [SSH] Opening SSH connection to xx.xx.xx.xx:22.
[01/26/23 20:59:39] [SSH] WARNING: SSH Host Keys are not being verified. Man-in-the-middle attacks may be possible against this connection.
[01/26/23 20:59:39] [SSH] Authentication successful.
[01/26/23 20:59:39] [SSH] The remote user's environment is:
..
Checking Java version in the PATH
openjdk version "11.0.18" 2023-01-17
OpenJDK Runtime Environment GraalVM CE 22.3.1 (build 11.0.18+10-jvmci-22.3-b13)
OpenJDK 64-Bit Server VM GraalVM CE 22.3.1 (build 11.0.18+10-jvmci-22.3-b13, mixed mode, sharing)
[01/26/23 20:59:42] [SSH] Checking java version of /<<some-path>>/jdk/bin/java
Couldn't figure out the Java version of /<<some-path>>/jdk/bin/java
bash: /<<some-path>>/jdk/bin/java: No such file or directory
[01/26/23 20:59:42] [SSH] Checking java version of java
[01/26/23 20:59:43] [SSH] java -version returned 11.0.18.
[01/26/23 20:59:43] [SSH] Starting sftp client.
[01/26/23 20:59:44] [SSH] Copying latest remoting.jar...
[01/26/23 20:59:45] [SSH] Copied 1,339,992 bytes.
Expanded the channel window size to 4MB
[01/26/23 20:59:45] [SSH] Starting agent process: cd "<<some_hudson_workspace_path>>" && java -jar remoting.jar -workDir <<some_hudson_workspace_path>> -jar-cache <<some_hudson_workspace_path>>/jarCache
led 26, 2023 8:59:46 ODP. org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using <<some_hudson_workspace_path>>\remoting as a remoting work directory
led 26, 2023 8:59:46 ODP. org.jenkinsci.remoting.engine.WorkDirManager setupLogging
INFO: Both error and output logs will be printed to <<some_hudson_workspace_path>>\remoting
<===[JENKINS REMOTING CAPACITY]===>channel started
Remoting version: 3044.vb_940a_a_e4f72e
Launcher: JCloudsLauncher
Communication Protocol: Standard in/out
This is a Windows agent
Anything else?
No response
Thanks for the report! I am not sure if this is a problem of openstack-cloud
plugin in failing to bring the node online, or cloud-stats-plugin
in failing to represent node status correctly.
- Does the node report to be online in Jenkins, or not? Is the node accepting builds?
- Do you have enough samples to be confident this appears on Windows machines only?
- Can you extract Jenkins service logs and grep it for the node name?
@olivergondza thanks for getting back to me so quickly
Thanks for the report! I am not sure if this is a problem of
openstack-cloud
plugin in failing to bring the node online, orcloud-stats-plugin
in failing to represent node status correctly.
- Does the node report to be online in Jenkins, or not? Is the node accepting builds?
No, node is N/A (highlighted in red) and a job run is cancelled after timeout, but node is never terminated. I have a reason why I believe it's an openstack-cloud
issue - if you inspect logging above, you can see there is never Agent successfully connected and online
among other logging events.
Also I'd like to stress out this happen quite "randomly" (meaning I can't related it to the cause) as sometimes there is 20 nodes provisioned correctly in row, sometimes it's couple of this in row. On average, the "launch" phase is done within 14 seconds.
- Do you have enough samples to be confident this appears on Windows machines only?
I can't disclose details, but we have quite a volume of linux-based runs and I can't see it there.
- Can you extract Jenkins service logs and grep it for the node name?
I'll try that next time I see the problem and get back to you here. In short - ATM no, but I'll be watching it.
@olivergondza, me and @pematous have encountered this issue recently. We decided to workaround it by not running Jenkins Agent on the Windows machines. Instead, we spawn Windows machines using openstackMachine step, which works reliably, and then connect to the machine by its IP using SSH/Ansible, as needed.
It is less convenient than having the Jenkins agent available, but it works!