Coverity builds broken
richardlau opened this issue ยท 19 comments
Two most recent node-daily-coverity builds have failed.
Some error about the agent going offline, no obvious other error.
e.g. https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3010/console
make[1]: *** Deleting file '/home/iojs/build/workspace/node-daily-coverity/out/Release/obj.target/v8_compiler/deps/v8/src/compiler/graph.o'
make[1]: *** Deleting file '/home/iojs/build/workspace/node-daily-coverity/out/Release/obj.target/v8_compiler/deps/v8/src/compiler/graph-visualizer.o'
make[1]: *** Deleting file '/home/iojs/build/workspace/node-daily-coverity/out/Release/obj.target/v8_compiler/deps/v8/src/compiler/graph-reducer.o'
make[1]: *** Deleting file '/home/iojs/build/workspace/node-daily-coverity/out/Release/obj.target/v8_compiler/deps/v8/src/compiler/frame-states.o'
FATAL: Unable to delete script file /tmp/jenkins5048327983852336012.sh
java.nio.channels.ClosedChannelException
at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:155)
at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:143)
at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:789)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@67405022:JNLP4-connect connection from 147.75.72.255/147.75.72.255:58626": Remote call on JNLP4-connect connection from 147.75.72.255/147.75.72.255:58626 failed. The channel is closing down or has closed down
at hudson.remoting.Channel.call(Channel.java:996)
at hudson.FilePath.act(FilePath.java:1230)
at hudson.FilePath.act(FilePath.java:1219)
at hudson.FilePath.delete(FilePath.java:1766)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:163)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:92)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
at hudson.model.Build$BuildExecution.build(Build.java:199)
at hudson.model.Build$BuildExecution.doRun(Build.java:164)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:526)
at hudson.model.Run.execute(Run.java:1895)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
at hudson.model.ResourceController.execute(ResourceController.java:101)
at hudson.model.Executor.run(Executor.java:442)
[Agent went offline during the build](https://ci.nodejs.org/computer/test%2Dequinix%2Dubuntu2204%2Dx64%2D1/log)
ERROR: Connection was broken
When I installed the new Jenkins workspace, I downloaded a more recent version of Coverity that I deployed on the existing machines too (and I updated the Jenkins job).
This is probably the compilation going out of memory. I saw the same symptoms on Fedora hosts. The kernel kills the entire process tree when it happens, including the Jenkins agent.
The Equinix machines (where the job was running) have more CPU/RAM (#3597 (comment)) than the IBM machine (#3597 (comment)) which I put back online a few hours ago.
The job is running
V=1 cov-build --dir cov-int make -j $(getconf _NPROCESSORS_ONLN)
which for the Equinix machine was 16 (which would appear to be 2 threads per each of the 8 cores). We could possibly set server_jobs
in the inventory to a lower number, which should set the JOBS
environment variable and then change the job to use JOBS
. I'm a bit wary of touching test-equinix-ubuntu2204-x64-1 at the moment as it's the machine that also runs the binary temp git repository used in the fanned jobs and the other Equinix workspace machine is currently down (#3721).
The IBM machine is 2 vCPUs/4 GB RAM, which is more like the regular test machines -- maybe adding 2GB swap like we did for the test machines would be sufficient, although the job is tending to prefer running on test-equinix-ubuntu2204-x64-1.
Or maybe we can be more drastic and shift the job to the Hetzner benchmark machines? I forget if there's a reason these had to run on the jenkins-workspace
machines other than having to have the Coverity build tool installed, which I've now automated in #3722.
Here's a run with hardcoded -j 6
: https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3013/
That build passed. I suggest to keep the hardcoded value until a better solution is implemented.
+1 on using the Hetzner machines.
I've updated the job to run on the benchmark machines instead of jenkins-workspace
(after running #3752 against the benchmark machines to install the Coverity Scan build tool). I've undone the workaround to hardcode -j 6
(it now uses ${JOBS}
which we can control via the Ansible inventory).
https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3037/ looks okay apart from an expected failure to upload/submit since we're limited to one upload per day. The next scheduled daily run would be expected to pass.
hmm. The scheduled build failed to upload:
https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3038/console
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 183 0 0 100 183 0 139 0:00:01 0:00:01 --:--:-- 139
100 183 0 0 100 183 0 79 0:00:02 0:00:02 --:--:-- 79
100 199 100 16 100 183 4 56 0:00:04 0:00:03 0:00:01 61
100 199 100 16 100 183 4 56 0:00:04 0:00:03 0:00:01 61
error code: 1016parse error: Invalid numeric literal at line 1, column 6
Rerunning: https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3039/console
Do we use jq
in the script? This error message seems to come from it.
It could be useful to print the response in case we're unable to parse it.
Yes, we use jq
-- the upload is a two step process where the first step is an API call to get a JSON response that contains the temporary URL to upload to.
We are already printing the response -- in this case
error code: 1016
I just logged into test-hetzner-ubuntu2204-x64-1 and checked the response
file in the workspace which has that content.
https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3040/ failed ๐
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 365 0 182 100 183 303 304 --:--:-- --:--:-- --:--:-- 608
500 Internal Server Error
If you are the administrator of this website, then please read this web application's log file and/or the web server's log file to find out what went wrong.jq: error (at response:1): Cannot index number with string "url"
parse error: Invalid numeric literal at line 1, column 13
i.e. the first call to the Coverity Scan API returned
500 Internal Server Error
If you are the administrator of this website, then please read this web application's log file and/or the web server's log file to find out what went wrong.
I guess we'll need to monitor this for a while.
FWIW we only have a small sample size, but the successful run was on test-hetzner-ubuntu2204-x64-2 while the two failing runs were on test-hetzner-ubuntu2204-x64-1.
https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3041/ succeeded.
https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3042/ looks like it succeeded on first glance, but the second stage of the upload failed
100 44 100 16 100 28 4 8 0:00:04 0:00:03 0:00:01 13
error code: 1016
https://scan.coverity.com/projects/node-js?tab=overview is currently showing "Version: v23.0.0-pre-50695e5de1" which is from 3041, but the page also says "Last Build Status: In-queue. Your build is currently being analyzed."
Both builds ran on test-hetzner-ubuntu2204-x64-1.
https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3043/console failed to upload:
Your build is already in the queue for analysis. \
Please wait till analysis finishes before uploading another build.
parse error: Invalid numeric literal at line 1, column 5
https://scan.coverity.com/projects/node-js?tab=overview:
I wonder if the failed to upload build from 3042 is now blocking further uploads. I've clicked "Terminate build", which responded:
The build has been scheduled for termination. There may be a delay before a new build can be resubmitted.
Retrying: https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3044/
Retrying: https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3044/
100 44 100 16 100 28 4 8 0:00:04 0:00:03 0:00:01 13
error code: 1016
๐
I logged into test-hetzner-ubuntu2204-x64-1 and manually ran the curl
command line to enqueue the build (the one in the job config to the URL ending /enqueue
). The first time I tried I got the same error:
iojs@test-hetzner-ubuntu2204-x64-1:~/build/workspace/node-daily-coverity$ curl --fail-with-body -X PUT -d token=<redacted> https://scan.coverity.com/projects/<redacted>/enqueue
curl: (22) The requested URL returned error: 530
error code: 1016
I immediately ran it again and it succeeded:
iojs@test-hetzner-ubuntu2204-x64-1:~/build/workspace/node-daily-coverity$ curl --fail-with-body -X PUT -d token=<redacted> https://scan.coverity.com/projects/<redacted>/enqueue
{"project_id":6507,"id":619487}
(I've added --fail-with-body
to the command in the job in the hope that will make the failure actually fail the build.)
This has changed https://scan.coverity.com/projects/node-js from saying the build is queued to
Last Build Status: Running. Your build is currently being analyzed
Yes, we use
jq
-- the upload is a two step process where the first step is an API call to get a JSON response that contains the temporary URL to upload to.
Small correction, the upload is a three step process:
- POST request to Coverity
/init
endpoint to get back JSON response containing temporary upload URL and build ID. - POST to temporary upload URL with build ID and artifacts from the build.
- POST to
/enqueue
endpoint with build ID.
Of the observed failures so far:
- build #3040 failed at step 1.
- build #3042 and build #3044 failed at step 3.
- build #3043 failed because step 3 wasn't completed for the previous build #3042.
https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3045/console failed at step 1:
100 199 100 16 100 183 4 53 0:00:04 0:00:03 0:00:01 57
07:49:03 curl: (22) The requested URL returned error: 530
07:49:03 error code: 1016
https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3046 failed at step 3:
100 44 100 16 100 28 4 8 0:00:04 0:00:03 0:00:01 13
07:53:23 curl: (22) The requested URL returned error: 530
07:53:23 error code: 1016
I've manually run step 3 on the machine to unstick the analysis queue in Coverity. I'll put a loop around the first and third steps so it retries a few times (with a pause between attempts).
Typically the three most recent Coverity builds since I added the retry loops all succeeded without having to retry ๐.
Since I put in the retry loop, we've only had one build failure, which occurred during the build (possibly a resource issue or agent failure): https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3056/
All other builds have succeeded and were able to submit the results to Coverity without needing to go through the retry loop, so we have no validation that the loop works/makes things better. Since the builds are succeeding at the moment and we're getting the static analysis run daily I'm going to close this issue.